Evolution
AI³
Adaptive Information
Adaptive Innovation
Adaptive Infrastructure
a·dap·tive adj. Showing or having a capacity to make fit for new or special situations; flexible; a successful adjustment.

Blogasbörd (cloud version):
Send Email   Get SIOC Profile   Get FOAF Profile   Syndicate full contents for this site using RSS 20
Main Links
Categories
Calendar
February 2013
S M T W T F S
« Jan    
 12
3456789
10111213141516
17181920212223
2425262728  
Archives
More . . .  
Credits
Blog software courtesy of WordPress Site Meter View Mike's profile on LinkedIn
6295
Search
Date:   February 28, 2011

Photo courtesy goldonomic.comWikipedia + UMBEL + Friends May Offer One Approach

In the first part of this series we argued for the importance of reference structures to provide the structures and vocabularies to guide interoperability on the semantic Web. The argument was made that these reference structures are akin to human languages, requiring a sufficient richness and terminology to enable nuanced and meaningful communications of data across the Web and within the context of their applicable domains.

While the idea of such reference structures is great — and perhaps even intuitive when likened to human languages — the question is begged as to what is the basis for such structures? Just as in human languages we have dictionaries, thesauri, grammar and style books or encyclopedia, what are the analogous reference sources for the semantic Web?

In this piece, we tackle these questions from the perspective of the entire Web. Similar challenges and approaches occur, of course, for virtually every domain and specific community. But, by focusing on the entirety of the Web, perhaps we can discern the grain of sand at the center of the pearl.

Bootstrapping the Semantic Web

The idea of bootstrapping is common in computers, compilers or programming. Every computer action needs to start from a basic set of instructions from which further instructions or actions are derived. Even starting up a computer (“booting up”) reflects this bootstrapping basis. Bootstrapping is the answer to the classic chicken-or-egg dilemma by embedding a starting set of instructions that provides the premise at start up [1]. The embedded operand for simple addition, for example, is the basis for building up more complete mathematical operations.

So, what is the grain of sand at the core of the semantic Web that enables it to bootstrap meaning? We start with the basic semantics and “instructions” in the core RDF, RDFS and OWL languages. These are very much akin to the basic BIOS instructions for computer boot up or the instruction sets leveraged by compilers. But, where do we go from there? What is the analog to the compiler or the operating system that gives us more than these simple start up instructions? In a semantics sense, what are the vocabularies or languages that enable us to understand more things, connect more things, relate more things?

To date, the semantic Web has given us perhaps a few dozen commonly used vocabularies, most of which are quite limited and simple pidgin languages such as DC, FOAF, SKOS, SIOC, BIBO, etc. We also have an emerging catalog of “things” and concepts from Wikipedia (via DBpedia) and similar. (Recall, in this piece, we are trying to look Web-wide, so the many fine building blocks for domain purposes such as found in biology, medicine, finance, astronomy, etc., are excluded.) The purposes and scope of these vocabularies widely differ and attack quite different slices of the information space. SKOS, for example, deals with describing simple knowledge structures like taxonomies or thesauri; SIOC is for describing social media.

By virtue of adoption, each of these core languages has proved its usefulness and role. But, as skew lines in space, how do these vocabularies relate to one another? And, how can all of the specific domain vocabularies also relate to those and one another where there are points of intersection or overlap? In short, after we get beyond the starting instructions for the semantic Web, what is our language and vocabulary? How do we complete the bootstrap process?

Clearly, like human languages, we need rich enough vocabularies to describe the things in our world and a structure of the relationships amongst those things to give our communications meaning and coherence. That is precisely the role provided by reference structures.

The Use and Role of ‘Gold Standards’

To prevent reference structures from being rubber rulers, some fixity or grounding needs to establish the common understanding for its referents. Such fixed references are often called ‘gold standards‘. In money, of course, this used to be a fixed weight of gold, until that basis was abandoned in the 1970s. In the metric system, there are a variety of fixed weights and measures that are employed. In the English language, the Oxford English Dictionary (OED) is the accepted basis for the lexicon. And so on.

Yet, as these examples show, none of these gold standards is absolute. Money now floats; multiple systems of measurement compete; a variety of dictionaries are used for English; most languages have their own reference sets; etc. The key point in all gold standards, however, is that there is wide acceptance for a defined reference for determining alignments and arbitrating differences.

Gold standards or reference standards play the role of referees or arbiters. What is the meaning of this? What is the definition of that? How can we tell the difference between this and that? What is the common way to refer to some thing?

Let’s provide one example in a semantic Web context. Let’s say we have a dataset and its schema A that we are aligning with another dataset with schema B. If I say two concepts align exactly across these datasets and you say differently, how do we resolve this difference? On one extreme, each of us can say our own interpretation is correct, and to heck with the other. On the other extreme, we can say both interpretations are correct, in which case both assertions are meaningless. Perhaps papering over these extremes is OK when only two competing views are in play, but what happens when real problems with many actors are at stake? Shall we propose majority rule, chaos, or the strongest prevails?

These same types of questions have governed human interaction from time immemorial. One of the reasons to liken the problem of operability on the semantic Web to human languages, as argued in Part I, is to seek lessons and guidance for how our languages have evolved. The importance of finding common ground in our syntax and vocabularies — and, also, critically, in how we accept changes to those — is the basis for communication. Each of these understandings needs to be codified and documented so that they can be referenced, and so that we can have some confidence of what the heck it is we are trying to convey.

For reference structures to play their role in plugging this gap — that is, to be much more than rubber rulers — they need to have such grounding. Naturally, these groundings may themselves change with new information or learning inherent to the process of human understanding, but they still should retain their character as references. Grounded references for these things — ‘gold standards’ — are key to this consensual process of communicating (interoperating).

Some ‘Gold Standards’ for the Semantic Web

The need for gold standards for the semantic Web is particularly acute. First, by definition, the scope of the semantic Web is all things and all concepts and all entities. Second, because it embraces human knowledge, it also embraces all human languages with the nuances and varieties thereof. There is an immense gulf in referenceability from the starting languages of the semantic Web in RDF, RDFS and OWL to this full scope. This gulf is chiefly one of vocabulary (or lack thereof). We know how to construct our grammars, but we have few words with understood relationships between them to put in the slots.

The types of gold standards useful to the semantic Web are similar to those useful to our analogy of human languages. We need guidance on structure (syntax and grammar), plus reference vocabularies that encompass the scope of the semantic Web (that is, everything). Like human languages, the vocabulary references should have analogs to dictionaries, thesauri and encyclopedias. We want our references to deal with the specific demands of the semantic Web in capturing the lexical basis of human languages and the connectedness (or not) of things. We also want bases by which all of this information can be related to different human languages.

To capture these criteria, then, I submit we should consider a basic starting set of gold standards:

  • RDF/RDFS/OWL — the data model and basic building blocks for the languages
  • Wikipedia — the standard reference vocabulary of things, concepts and entities, plus other structural guidances
  • WordNet — lexical language references as an aid to natural language processing, and
  • UMBEL — the structural reference for the connectedness of things for basic coherence and inference, plus a vocabulary for mapping amongst reference structures and things.

Each of these potential gold standards is next discussed in turn. The majority of discussion centers on Wikipedia and UMBEL.

RDF/RDFS/OWL: The Language

Naturally, the first suggested gold standard for the semantic Web are the RDF/RDFS/OWL language components. Other writings have covered their uses and roles [2]. In relation to their use as a gold standard, two documents, one on RDF semantics [3] and the other an OWL [4] primer, are two great starting points. Since these languages are now in place and are accepted bases of the semantic Web, we will concentrate on the remaining members of the standard reference set.

Wikipedia: The Vocabulary (and More)

The second suggested gold standard for the semantic Web is Wikipedia, principally as a sort of canonical vocabulary base or lexicon, but also for some structural aspects. Wikipedia now contains about 3.5 million English articles, by far larger than any other knowledge base, and has more than 250 language versions. Each Wikipedia article acts as more or less a reference for the thing it represents. In addition, the size, scope and structure of Wikipedia make it an unprecedented resource for researchers engaged in natural language processing (NLP), information extraction (IE) and semantic Web-related tasks.

For some time I have been maintaining a listing called SWEETpedia of academic and research articles focused on the use of Wikipedia for these tasks. The latest version tracks some 250 articles [5], which I guess to be about one half or more of all such research extant. This research shows a broad variety of potential roles and contributions from Wikipedia as a gold standard for the semantic Web, some of which is detailed in the tables below.

An excellent report by Olena Medelyan et al. from the University of Waikato in New Zealand, Mining Meaning from Wikipedia, organized this research up through 2008 and provided detailed commentary and analysis of the role of Wikipedia [6]. They noted, for example, that Wikipedia has potential use as an encyclopedia (its intended use), a corpus for testing and modeling NLP tasks, as a thesaurus, a database, an ontology or a network structure. The Intelligent Wikipedia project from the University of Washington has also done much innovative work on “automatically learned systems [that] can render much of Wikipedia into high-quality semantic data, which provides a solid base to bootstrap toward the general Web” [7].

However, as we proceed through the next discussions, we’ll see that the weakest aspect of Wikipedia is its category structure. Thus, while Wikipedia is unparalleled as the gold standard for a reference vocabulary for the Web, and has other structural uses as well, we will need to look elsewhere for how that content is organized.

Major Wikipedia Initiatives

Many groups have recognized these advantages for Wikipedia, and have built knowledge bases around it. Also, many of these groups have also recognized the category (schema) weaknesses in Wikipedia and have proposed alternatives. Some of these major initiatives, which also collectively represent a large number of the research articles in SWEETpedia, include:

Project Schema Basis Comments
DBpedia Wikipedia Infoboxes excellent source for URI identifiers; structure extraction basis used by many other projects
Freebase User Generated schema are for domains based on types and properties; at one time had a key dependence on Wikipedia; has since grown much from user-generated data and structure; now owned by Google
Intelligent Wikipedia Wikipedia Infoboxes a broad program and a general set of extractors for obtaining structure and relationships from Wikipedia; was formerly known as KOG; from Univ of Washington
SIGWP Wikipedia Ontology the Special Interest Group of Wikipedia (Research or Mining); a general group doing research on Wikipedia structure and mining; schema basis is mostly from a thesaurus; group has not published in two years
UMBEL UMBEL Reference Concepts RefConcepts based on the Cyc knowledge base; provides a tested, coherent concept schema, but one with gaps regarding Wikipedia content; has 28,000 concepts mapped to Wikipedia
WikiNet Extracted Wikipedia Ontology part of a long-standing structure extraction effort from Wikipedia leading to an ontology; formerly known as WikiRelate; from the Heidelberg Institute for Theoretical Studies (HITS)
Wikipedia Miner N/A generalized structure extractor; part of a wider basis of Wikipedia research at the Univ of Waikato in New Zealand
Wikitology Wikipedia Ontology general RDF and ontology-oriented project utilizing Wikipedia; effort now concluded; from the Ebiquity Group at the Univ of Maryland
YAGO WordNet maps Wordnet to Wikipedia, with structured extraction of relations for characterizing entities


It is interesting to note that none of the efforts above uses the Wikipedia category structure “as is” for its schema.

Structural Sources within Wikipedia

The surface view of Wikipedia is topic articles placed into one or more categories. Some of these pages also include structured data tables (or templates) for the kind of thing the article is; these are called infoboxes. An infobox is a fixed-format table placed at the top right of articles to consistently present a summary of some unifying aspect that the articles share. For example, see the listing for my home town, Iowa City, which has a city infobox.

However, this cursory look at Wikipedia in fact masks much additional and valuable structure. Some early researchers noted this [8]. The recognition of structure has also been a key driver for the interest in Wikipedia as a knowledge base (in addition to its global content scope). The following table is a fairly complete listing of structure possibilities within Wikipedia (see Endnotes for any notes):

Wikipedia Structure Potential Applications Note
Corpus
Entire Corpus
knowledge base; graph structure; corpus for n-grams, other constructions [9]
Categories
Category
category suggestion; semantic relatedness; query expansion; potential parent category
Contained Articles
semantically-related terms (siblings)
Hierarchy
hyponymic and meronymic relations between terms
Listing Pages/Categories
semantically-related terms (siblings)
Patterned Categories
functional metadata [9]
Infobox Templates
Attributes
synonyms; key-value pairs
Values
units of measure; fact extraction [9]
Items
category suggestion; entity suggestion
Geolocational
coordinates; places; geolocational; (may also appear in full article text)
Issue Templates
Multiple Types
exclusion candidates; other structural analysis; examples include Stub, Message Boxes, Multiple Issues [9]
Category Templates [13]
Category Name
disambiguation; relatedness
Category Links
semantic relatedness
Articles
First Paragraph
definition; abstract
Full Text
complete discussion; related terms; context; translations; NLP analysis basis; relationships; sentiment
Redirects
synonymy; spelling variations, misspellings; abbreviations; query expansion
Title
named entities; domain specific terms or senses
Subject
category suggestion (phrase marked in bold in first paragraph)
Section Heading(s)
category suggestion; semantic relatedness [9]
See Also
related concepts; query expansion [9]
Further Reading
related concepts [9,10]
External Links
related concepts; external harvest points
Article Links
Context
related terms; co-occurrences
Label
synonyms; spelling variations; related terms; query expansion
Target
link graph; related terms
LinksTo
category suggestion; functional metadata
LinkedFrom
category suggestion; functional metadata
References
Citations
external harvest points [9,10]
Media
Images
thumbnails; image recognition for disambiguation; controversy (edit/upload frequency) [11]
Captions
related concepts; related terms; functional metadata [9]
Disambiguation Pages
Article Links
sense inventory
Discussion Pages
Discussion Content
controversy
Redux for Article Structure
see Articles for uses
History Pages
Edit Frequency
topicalness; controversy (diversity of editors, reversions)
Edit Basis
lexical errors [9]
Lists
Hyponyms
instances; named entity candidates
Alternate Language Versions
Redux for All Structures
see all items above; translation; multilingual alignment; entity disambiguation [12]

The potential for Wikipedia to provide structural understandings is evident from this table. However, it should be noted that, aside from some stray research initiatives, most effort to date has focused on the major initiatives noted earlier or from analyzing linking and infoboxes. There is much additional research that could be powered by the Wikipedia structure as it presently exists.

From the standpoint of the broader semantic Web, the potential of Wikipedia in the areas of metadata enhancement and mapping to multiple human languages [12] are particularly strong. We are only now at the very beginning phases of tapping this potential.

Structural Weaknesses

The three main weaknesses with Wikipedia are its category structure [14], inconsistencies and incompleteness. The first weakness means Wikipedia is not a suitable organizational basis for the semantic Web; the next two weaknesses, due to the nature of Wikipedia’s user-generated content, are constantly improving.

Our recent effort to map between UMBEL and Wikipedia, undertaken as part of the recent UMBEL v 1.00 release, spent considerable time analyzing the Wikipedia category structure [15]. Of the roughly half million categories in Wikipedia, only about 85,000 were found to be suitable candidates to participate in an actual schema structure. Further breakdowns are shown by this table resulting from our analysis:

Wikipedia Category Breakdowns
Removals 20.7%
Administrative 15.7%
Misc Cleaning 5.0%
Functional (not schema) 61.8%
Fn Dates 10.1%
Fn Nationalities 9.6%
Fn Listings, related 0.8%
Fn Occupations 1.0%
Fn Prepositions 40.4%
Candidates 17.4%
SuperTypes 1.7%
General Structure 15.7%
TOTAL 100.0%

Fully 1/5 of the categories are administrative or internal in nature. The large majority of categories are, in fact, not structural at all, but what we term functional categories, which means the category contains faceting information (such as subclassifying musicians into British musicians) [16]. Functional categories can be a rich source of supplementary metadata for its assigned articles — though, no one has yet processed Wikipedia in this manner — but are not a useful basis for structural conceptual relationships or inferencing.

This weakness in the Wikipedia category system has been known for some time [17], but researchers and others still attempt to do mappings on mostly uncleaned categories. Though most researchers recognize and remove internal or administrative categories in their efforts, using the indiscriminate remainder of categories still leads to poor precision in resulting mappings. In fact, in comparison to one of the more rigorous assessments to date [18], our analysis still showed a 6.8% error rate in hand inspected categories.

Other notable category problems include circular references, skipped intermediate categories, misassigned categories and incomplete assignments.

Nonetheless, Wikipedia categories do have a valuable use in the analysis of local relationships (one degree of relatedness) and for finding missing category candidates. And, as noted, the functional categories are also a rich and untapped source of additional article metadata.

Like any knowledge base, Wikipedia also has inconsistent and incomplete coverage of topics [19]. However, as more communities accept Wikipedia as a central resource deserving completeness, we should see these gaps continue to get filled.

The DBpedia Implementation

One of the first database versions of Wikipedia built for semantic Web purposes is DBpedia. DBpedia has an incipient ontology useful for some classification purposes. Its major structural organization is built around the Wikipedia infoboxes, which are applied to about a third of Wikipedia articles. DBpedia also has multiple language versions.

DBpedia is a core hub of Linked Open Data (LOD), which now has about 300 linked datasets; has canonical URIs used by many other applications; has extracted versions and tools very useful for further processing; and has recently moved to incorporate live updates from the source Wikipedia [20]. For these reasons, the DBpedia version of Wikipedia is the suggested implementation version.

WordNet: Language Relationships

The third suggested gold standard for the Semantic Web is WordNet, a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. There are over 50 languages covered by wordnet approaches, most mapped to this English WordNet [21].

Though it has been used in many ontologies [22], WordNet is most often mapped for its natural language purposes and not used as a structure of conceptual relationships per se. This is because it is designed for words and not concepts. It contains hundreds of basic semantic inconsistencies and also lacks much domain applicability. Entities, of course, are also lacking. In those cases where WordNet has been embraced as a schema basis, much work is generally expended to transform it into an ontology suitable for knowledge representation.

Nonetheless, for word sense disambiguation and other natural language processing tasks, as well as for aiding multi-lingual mappings, WordNet and its various other language variants is a language reference gold standard.

UMBEL: A Coherent Structure

So, with these prior gold standards we gain a basic language and grammar; a base (canonical) vocabulary and some structure guidance; and a reference means for processing and extracting information from input text. Yet two needed standards remain.

One needed standard is a conceptual organizing structure (or schema) by which the canonical vocabulary of concepts and instances can be related. This core structure should be constructed in a coherent [23] manner and expressly designed to support inferencing and (some) reasoning. This core structure should be sufficiently large to embrace the scope of the semantic Web, but not so detailed as to make it computationally inefficient. Thus, the core structure should be a framework that allows more focused and purposeful vocabularies to be “plugged in”, depending on the domain and task at hand. Unfortunately, the candidate category structures from our other gold standards in Wikipedia and WordNet do not meet these criteria.

A second needed standard is a bit of additional vocabulary “glue” specifically designed for the purposes of the semantic Web and ontology and domain incorporation. We have multiple and disparate world views and contexts, as well as the things described by them [24]. To get them to interoperate — and to acknowledge differences in alignment or context — we need a set of relational predicates (vocabulary) that can capture a range of mappings from the exact to the approximate [25]. Unlike other reference vocabularies that attempt to capture canonical definitions within defined domains, this vocabulary is expressly required by the semantic Web and its goal to federate different data and schema.

UMBEL has been expressly designed to address both of these two main needs [26]. UMBEL is a coherent categorization structure for the semantic Web and a mapping vocabulary designed for dataset and conceptual interoperability. UMBEL’s 28,000 reference concepts (RefConcepts) are based on the Cyc knowledge base [27], which itself is expressly designed as a common sense representation of the world with express variations in context supported via its 1000 or so microtheories. Cyc, and UMBEL upon which it is based, are by no means the “correct” or “only” representations of the world, but they are coherent ones and thus internally consistent.

UMBEL’s role to allow datasets to be “plugged in” and related through some fixed referents was expressed by this early diagram [28]:

Lightweight Binding to an Upper Subject Structure Can Bring Order
[Click on image for full-size pop-up]

The idea — which is still central to this kind of reference structure — is that a set of reference concepts can be used by multiple datasets to connect and then inter-relate. These are shown by the nested subjects (concepts) in the umbrella structure.

UMBEL, of course, is not the only coherent structure for such interoperability purposes. Other major vocabularies (such as LCSH; see below) or upper-level ontologies (such as SUMO, DOLCE, BFO or PROTON, etc.) can fulfill portions of these roles, as well. In fact, the ultimate desire is for multiple reference structures to emerge that are mapped to one another, similar to how human languages can inter-relate. Yet, even in that desired vision, there is still a need for a bootstrapped grounding. UMBEL is the first such structure expressly designed for the two needed standards.

Mappings to the Other Standards

UMBEL is already based on the central semantic Web languages of RDF, RDFS, SKOS, and OWL 2. The recent version 1.00 now maps 60% of UMBEL to Wikipedia, with efforts for the remaining in process. UMBEL provides mappings to WordNet, via its Cyc relationships. More of this is in process and will be exposed. And the mappings between UMBEL and GeoNames [29] for locational purposes is also nearly complete.

The Gold Resides in Combining These Standards

Each of these reference structures — RDF/OWL, Wikipedia, WordNet, UMBEL — is itself coherent and recognized or used by multiple parties for potential reference purposes on the semantic Web. The advocacy of them as standards is hardly radical.

However, the gold lies in the combination of these components. It is in this combination that we can see a grounded knowledge base emerge that is sufficient for bootstrapping the semantic Web.

The challenge in creating this reference knowledge base is in the mapping between the components. Fortunately, all of the components are already available in RDF/OWL. WordNet already has significant mappings to Wikipedia and UMBEL. And 60% of UMBEL is already mapped to Wikipedia. The remaining steps for completing these mappings are very near at hand. Other vocabularies, such as GeoNames [29], would also beneficially contribute to such a reference base.

Yet to truly achieve a role as a gold standard, these mappings should be fully vetted and accurate. Automated techniques that embed errors are unacceptable. Gold standards should not themselves be a source for propagation of errors. Like dictionaries or thesauri, we need reference structures that are quality and deserving of reference. We need canonical structures and canonical vocabularies.

But, once done, these gold standards themselves become reference sources that can aid automatic and semi-automatic mappings of other vocabularies and structures. Thus, the real payoff is not that these gold standards themselves get actually embedded in specific domain uses or whatever, but that they can act as reference referees for helping align and ground other structures.

Like the bootstrap condition, more and more reference structures may be brought into this system. A reference structure does not mean reliance; it need not even have more than minimal use. As new structures and vocabularies are brought into the mix, appropriate to specific domains or purposes, reference to other grounding structures will enable the structures and vocabularies to continue to expand. So, not only are reference concepts necessary for grounding the semantic Web, but we also need to pick good mapping predicates for properly linking these structures together.

In this manner, many alternative vocabularies can be bootstrapped and mapped and then used as the dominant vocabularies for specific purposes. For example, at the level of general knowledge categorization, vocabularies such as LCSH, the Dewey Decimal Classification, UDC, etc., can be preferentially chosen. Other specific vocabularies are at the ready, with many already used for domain purposes. Once grounded, these various vocabularies can also interoperate.

Grounding in gold standards enables the freedom to switch vocabularies at will. Establishing fixed reference points via such gold standards will power a virtuous circle of more vocabularies, more mappings, and, ultimately, functional interoperability no matter the need, domain or world view.

This is the last of a two-part series on the importance and choice of reference structures (Part I) and gold standards (Part II) on the semantic Web.

[1] For example, according to the Wikipedia entry on Machine code, “A machine code instruction set may have all instructions of the same length, or it may have variable-length instructions. How the patterns are organized varies strongly with the particular architecture and often also with the type of instruction. Most instructions have one or more opcode fields which specifies the basic instruction type (such as arithmetic, logical, jump, etc) and the actual operation (such as add or compare) and other fields that may give the type of the operand(s), the addressing mode(s), the addressing offset(s) or index, or the actual value itself.”
[2] See, for example, M.K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Information blog, April 8, 2009; see http://www.mkbergman.com/483/advantages-and-myths-of-rdf/ and M.K. Bergman, 2010. “Ontology Tutorial Series,” AI3:::Adaptive Information blog, September 27, 2010; see http://www.mkbergman.com/916/ontology-tutorial-series/.
[3] Patrick Hayes, ed., 2004. RDF Semantics, W3C Recommendation 10 February 2004. See http://www.w3.org/TR/rdf-mt/.
[4] Pascal Hitzler et al., eds., 2009. OWL 2 Web Ontology Language Primer, a W3C Recommendation, 27 October 2009; see http://www.w3.org/TR/owl2-primer/.
[5] See SWEETpedia from the AI3:::Adaptive Information blog, which currently lists about 250 articles and citations.
[6] Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp. See http://arxiv.org/ftp/arxiv/papers/0809/0809.4530.pdf. This paper and its findings is discussed more in M.K. Bergman, 2008. “Research Shows Natural Fit between Wikipedia and Semantic Web,” AI3:::Adaptive Information blog, October 15, 2008; see http://www.mkbergman.com/460/research-shows-natural-fit-between-wikipedia-and-semantic-web/.
[7] For a comprehensive treatment, see Fei Wu, 2010. Machine Reading: from Wikipedia to the Web, a doctoral thesis to the Department of Computer Science, University of Washington, 154 pp; see http://ai.cs.washington.edu/www/media/papers/Wu-thesis-2010.pdf. To my knowledge, this paper also was the first to use the “bootstrapping” metaphor.
[8] Quite a few research papers have characterized various aspects of the Wikipedia structure. One of the first and most comprehensive was Torsten Zesch, Iryna Gurevych, Max Mühlhäuser, 2007b. Analyzing and Accessing Wikipedia as a Lexical Semantic Resource, and the longer technical report. See http://www.ukp.tu-darmstadt.de/software/JWPL. Also, 2008. In Proceedings of the Biannual Conference of the Society for Computational Linguistics and Language Technology, pp. 213221. Also, for another early discussion, see Linyun Fu, Haofen Wang, Haiping Zhu, Huajie Zhang, Yang Wang and Yong Yu, 2007. Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring. See http://data.semanticweb.org/pdfs/iswc-aswc/2007/ISWC2007_RT_Fu.pdf
[9] This structural basis in Wikipedia is largely untapped.
[10] Citations and references appear to be highly selective (biased) in Wikipedia; nonetheless, those available are useful seeding points for more suitable harvests.
[11] Images have been used a thumbnails and linked references to the articles they are hosted in, but have not been analyzed much for semantics or file names.
[12] There are a variety of efforts underway to use Wikipedia as a multi-language cross-reference based on its 250 language versions; search, for example, on “multiple language” in SWEETpedia. Both named entity and concept matches can be used to correlate in multiple languages. This is greatly aided by inter-language links.
[13] When present, these appear at the bottom of an article and have many related categories; see this one for the semantic Web.
[14] See further http://en.wikipedia.org/wiki/Wikipedia:Category and http://en.wikipedia.org/wiki/Wikipedia:Categorization_FAQ for a discussion of use and guidelines for Wikipedia categories.
[15] For the release notice, see http://umbel.org/content/finally-umbel-v-100. Annex H to the UMBEL Specifications provides a description of the mapping methodologies and results.
[16] Functional categories combine two or more facets in order to split or provide more structured characterization of a category. For example, Category:English cricketers of 1890 to 1918, has as its core concept the idea of a cricketer, a sports person. But, this is also further characterized by nationality and time period. Functional categories tend to have a A x B x C construct, with prepositions denoting the facets. From a proper characterization standpoint, the items in this category should be classified as a Person –> Sports Person –> Cricketer, with additional facets (metadata) of being English and having the period 1890 to 1981 assigned.
[17] See, for example, Massimo Poesio et al., 2008. ELERFED: Final Report, see http://www.cl.uni-heidelberg.de/~ponzetto/pubs/poesio07.pdf, wherein they state, “We discovered that in the meantime information about categories in Wikipedia had grown so much and become so unwieldy as to limit its usefulness.” Additional criticisms of the category structure may be found in S. Chernov, T. Iofciu, W. Nejdl and X. Zhou, 2006. “Extracting Semantic Relationships between Wikipedia Categories,” in Proceedings of the 1st International Workshop: SemWiki’06—From Wiki to Semantics., co-located with the 3rd Annual European Semantic Web Conference ESWC’06 in Budva, Montenegro, June 12, 2006; and L Muchnik, R. Itzhack, S. Solomon and Y. Louzon, 2007. “Self-emergence of Knowledge Trees: Extraction of the Wikipedia Hierarchies,” in Physical Review E 76(1). Also, this blog post from Bob Bater at KOnnect, “Wikipedia’s Approach to Categorization,” September 22, 2008, provides useful comments on category issues; see http://iskouk.wordpress.com/2008/09/22/wikipedias-approach-to-categorization/.
[18] Olena Medelyan and Cathy Legg, 2008. Integrating Cyc and Wikipedia: Folksonomy Meets Rigorously Defined Common-Sense, in Proceedings of the WIKI-AI: Wikipedia and AI Workshop at the AAAI08 Conference, Chicago, US. See http://www.cs.waikato.ac.nz/~olena/publications/Medelyan_Legg_Wikiai08.pdf.
[19] As two references among many, see A. Halavais and D. Lackaff, 2008. “An Analysis of Topical Coverage of Wikipedia,” in Journal of Computer-Mediated Communication 13 (2): 429–440; and A. Kittur, E. H. Chi and B. Suh, 2009. “What’s in Wikipedia? Mapping Topics and Conflict using Socially Annotated Category Structure,” in Proceedings of the 27th Annual CHI Conference on Human Factors in Computing Systems, pp 4–9.
[20] See DBpedia.org, especially DBpedia reference.
[21] See http://www.globalwordnet.org/gwa/wordnet_table.htm for a listing of known wordnets by language.
[22] For example, see this listing in Wikipedia.
[23] M.K. Bergman, 2008. “When is Content Coherent?,” AI3:::Adaptive Information blog, July 25, 2008; see http://www.mkbergman.com/450/when-is-content-coherent/.
[24] For a couple of useful references on this topic, first see this discussion regarding contexts (and the possible relation to Cyc microtheories): Ramanathan V. Guha, Rob McCool, and Richard Fikes, 2004. “Contexts for the Semantic Web,” in Sheila A. McIlraith, Dimitris Plexousakis, and Frank van Harmelen, eds., International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science, pp. 32-46. Springer, 2004. See http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.58.2368&rep=rep1&type=pdf. For another discussion about local differences and contexts and the difficulty of reliance on “common” understandings, see: Krzysztof Janowicz, 2010. “The Role of Space and Time for Knowledge Organization on the Semantic Web,” in Semantic Web 1: 25–32; see http://iospress.metapress.com/content/636610536×307213/fulltext.pdf.
[25] OWL already provides the exact predicates; see further M.K. Bergman, 2010. “The Nature of Connectedness on the Web,” AI3:::Adaptive Information blog, November 22, 2010, 2008; see http://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/ and the UMBEL mapping predicates in this vocabulary listing.
[26] UMBEL is a reference of 28,000 concepts (classes and relationships) derived from the Cyc knowledge base. The reference concepts of UMBEL are mapped to Wikipedia, DBpedia ontology classes, GeoNames and PROTON. UMBEL is designed to facilitate the organization, linkage and presentation of heterogeneous datasets and information. It is meant to lower the time, effort and complexity of developing, maintaining and using ontologies, and aligning them to other content. See further the UMBEL Specifications (including Annexes A – H), Vocabulary and RefConcepts.
[27] Cyc is an artificial intelligence project that has assembled a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal to provide human-like reasoning. The OpenCyc version 3.0 contains nearly 200,000 terms and millions of relationship assertions. Started in 1984, by 2010 an estimated 1000 person years had been invested in its development.
[28] This image and more related to the general question of interoperability in relation to a reference structure is provided in M.K. Bergman, 2007, “Where are the Road Signs for the Structured Web?,” AI3:::Adaptive Information blog, May 29, 2007; see http://www.mkbergman.com/375/where-are-the-road-signs-for-the-structured-web/.
[29] GeoNames is a geographical database available for free download under a Creative Commons Attribution license. It contains over 10 million geographical names and consists of 7.5 million unique features, of which 2.8 million are populated places. All features are categorized into one out of nine feature classes and further subcategorized into one out of 645 feature codes. Given the importance of locational information, GeoNames is a natural complement to the gold standards mentioned herein. See further its Web site, which also showcases a nifty browser of mappings to Wikipedia.

Posted by AI3's author, Mike Bergman

Posted on February 28, 2011 at 12:07 am in Semantic Web, Structured Web, UMBEL | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/947/in-search-of-gold-standards-for-the-semantic-web/
The URI to trackback this post is: http://www.mkbergman.com/947/in-search-of-gold-standards-for-the-semantic-web/trackback/
Date:   February 21, 2011

Hitting the Sweet SpotReference Structures Provide a Third Way

Since the first days of the Web there has been an ideal that its content could extend beyond documents and become a global, interoperating storehouse of data. This ideal has become what is known as the “semantic Web“. And within this ideal there has been a tension between two competing world views of how to achieve this vision. At the risk of being simplistic, we can describe these world views as informal v formal, sometimes expressed as “bottom up” v “top down” [1,2].

The informal view emphasizes freeform and diversity, using more open tagging and a bottoms-up approach to structuring data [3]. This group is not anarchic, but it does support the idea of open data, open standards and open contributions. This group tends to be oriented to RDF and is (paradoxically) often not very open to non-RDF structured data forms (as, for example, microdata or microformats). Social networks and linked data are quite central to this group. RDFa, tagging, user-generated content and folksonomies are also key emphases and contributions.

The formal view tends to support more strongly the idea of shared vocabularies with more formalized semantics and design. This group uses and contributes to open standards, but is also open to proprietary data and structures. Enterprises and industry groups with standard controlled vocabularies and interchange languages (often XML-based) more typically reside in this group. OWL and rules languages are more often typically the basis for this group’s formalisms. The formal view also tends to split further into two camps: one that is more top down and engineering oriented, with typically a more closed world approach to schema and ontology development [4]; and a second that is more adaptive and incremental and relies on an open world approach [5].

Again, at the risk of being simplistic, the informal group tends to view many OWL and structured vocabularies, especially those that are large or complex, as over engineered, constraining or limiting freedom. This group often correctly points to the delays and lack of adoption associated with more formal efforts. The informal group rarely speaks of ontologies, preferring to use the term of vocabularies. In contrast, the formal group tends to view bottoms-up efforts as chaotic, poorly structured and too heterogeneous to allow machine reasoning or interoperability. Some in the formal group sometimes advocate certification or prescribed training programs for ontologists.

Readers of this blog and customers of Structured Dynamics know that we more often focus on the formal world view and more specifically from an open world perspective. But, like human tribes or different cultures, there is no one true or correct way. Peaceful coexistence resides in the understanding of the importance and strength of different world views.

Shared communication is the way in which we, as humans, learn to understand and bridge cultural and tribal differences. These very same bases can be used to bridge the differences of world views for the semantic Web. Shared concepts and a way to communicate them (via a common language) — what I call reference structures [6] — are one potential “sweet spot” for bridging these views of the semantic Web [7].

Referring to Referents as Reference

According to Merriam Webster and Wikipedia, a reference is the intentional use of one thing, a point of reference or reference state, to indicate something else. When reference is intended, what the reference points to is called the referent. References are indicated by sounds (like onomatopoeia), pictures (like roadsigns), text (like bibliographies), indexes (by number) and objects (a wedding ring), but many other methods can be used intentionally as references. In language and libraries, references may include dictionaries, thesauri and encyclopedias. In computer science, references may include pointers, addresses or linked lists. In semantics, reference is generally construed as the relationships between nouns or pronouns and objects that are named by them.

The Building Blocks of Language

Structures, or syntax, enable multiple referents to be combined into more complex and meaningful (interpretable) systems. Vocabularies refer to the set of tokens or words available to act as referents in these structures. Controlled vocabularies attempt to limit and precisely define these tokens as a means of reducing ambiguity and error. Larger vocabularies increase richness and nuance of meaning for the tokens. Combined, syntax, grammar and vocabularies are the building blocks for constructing understandable human languages.

Many researchers believe that language is an inherent capability of humans, especially including children. Language acquisition is expressly understood to be the combined acquisition of syntax, vocabulary and phonetics (for spoken language). Language development occurs via use and repetition, in a social setting where errors are corrected and communication is a constant. Via communication and interaction we learn and discover nuance and differences, and acquire more complex understandings of syntax structures and vocabulary. The contact sport of communication is itself a prime source for acquiring the ability to communicate. Without the structure (syntax) and vocabulary acquired through this process, our language utterances are mere babblings.

Pidgin languages emerge when two parties try to communicate, but do not share the same language. Pidgin languages result in much simplified vocabularies and structure, which lead to frequent miscommunication. Small vocabularies and limited structure share many of these same limitations.

Communicating in an Evanescent Environment

Information theory going back to Shannon defined that the “fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point” [8]. This assertion applies to all forms of communication, from the electronic to animal and human language and speech.

Every living language is undergoing constant growth and change. Current events and culture are one driver of new vocabulary and constructs. We all know the apocryphal observation that northern peoples have many more words for snow, for example. Jargon emerges because specific activities, professions, groups, or events (including technical change) often have their own ideas to communicate. Slang is local or cultural usage that provides context and communication, often outside of “formal” or accepted vocabularies. These sources of environmental and other changes cause living languages to be constantly changing in terms of vocabulary and (also, sometimes) structure.

Natural languages become rich in meaning and names for entities to describe and discern things, from plants to people. When richness is embedded in structure, contexts can emerge that greatly aid removing ambiguity (“disambiguating”). Contexts enable us to discern polysemous concepts (such as bank for river, money institution or pool shot) or similarly named entities (such as whether Jimmy Johnson is a race car driver, football coach, or a local plumber). As with vocabulary growth, contexts sometimes change in meaning and interpretation over time. It is likely the Gay ’90s would not be used again to describe a cultural decade (1890s) in American history.

All this affirms what all of us know about human languages:  they are dynamic and changing. Adaptable (living) languages require an openness to changing vocabulary and changing structure. The most dynamic languages also tend to be the most open to the coining of new terminology; English, for example, is estimated to have 25,000 new words coined each year [9].

The Semantic Web as a Human Language

One could argue that similar constructs must be present within the semantic Web to enable either machine or human understanding. At first blush this may sound a bit surprising:  Isn’t one premise of the semantic Web machine-to-machine communications with “artificial intelligence” acting on our behalf in the background? Well, hmmm, OK, let’s probe that thought.

Recall there are different visions about what constitutes the semantic Web. In the most machine-oriented version, the machines are posited to replace some of what we already do and anticipate what we already want. Like Watson on Jeopardy, machines still need to know that Toronto is not an American city [10]. So, even with its most extreme interpretation — and one that is more extreme than my own view of the near-term semantic Web — machine-based communication still has these imperatives:

  • Humans, too, interact with data and need to understand it
  • Much of the data to be understood and managed is based on human text (unstructured), and needs to be adequately captured and represented
  • There is no basis to think that machine languages can be any simpler in representing the world than human languages.

These points suggest that machine languages, even in the most extreme machine-to-machine sense, still need to have a considerable capability akin to human languages.  Of course, computer programming languages and data exchange languages as artificial languages need not read like a novel. In fact, most artificial languages have more constraints and structure limitations than human languages. They need to be read by machines with fixed instruction sets (that is, they tend to have fewer exceptions and heuristics).

But, even with software or data, people write and interact with these languages, and human readability is a key desirable aspect for modern artificial languages [11]. Further, there are some parts of software or data that also get expressed as labels in user interfaces or for other human factors. The admonition to Web page developers to “view source” is a frequent one. Any communication that is text based — as are all HTTP communications on the Web, including the semantic Web — has this readability component.

Though the form (structure) and vocabulary (tokens) of languages geared to machine use and understanding most certainly differ from that used by humans, that does not mean that the imperatives for reference and structure are excused. It seems evident that small vocabularies, differing vocabularies and small and incompatible structures have the same limiting effect on communications within the semantic Web as they do for human languages.

Yet, that being said, correcting today’s relative absence of reference and structure on the nascent semantic Web should not then mean an overreaction to a solution based on a single global structure. This is a false choice and a false dichotomy, belied by the continued diversity of human languages [12]. In fact, the best analog for an effective semantic Web might be human languages with their vocabularies, references and structures. Here is where we may find the clues for how we might improve the communications (interoperability) of the semantic Web.

A Call for Vehement Moderation

Freeform tagging and informal approaches are quick and adaptive. But, they lack context, coherence and a basis for interoperability. Highly engineered ontologies capture nuance and sophistication. But, they are difficult and expensive to create, lack adoption and can prove brittle. Neither of these polar opposites is “correct” and each has its uses and importance. Strident advocacy of either extreme alone is shortsighted and unsuited to today’s realities. There is not an ineluctable choice between freedom and formalism.

An inherently open and changing world with massive growth of information volumes demands a third way. Reference structures and vocabularies sufficient to guide (but not constrain) coherent communications are needed. Structure and vocabulary in an open and adaptable language can provide the communication medium. Depending on task, this language can be informal (RDF or data struct forms convertible to RDF) or formal (OWL). The connecting glue is provided by the reference vocabularies and structures that bound that adaptable language. This is the missing “sweet spot” for the semantic Web.

Just like human languages, these reference structures must be adaptable ones that can accommodate new learning, new ideas and new terminology. Yet, they must also have sufficient internal consistency and structure to enable their role as referents. And, they need to have a richness of vocabulary (with defined references) sufficient to capture the domain at hand. Otherwise, we end up with pidgin communications.

We can thus see a pattern emerging where informal approaches are used for tagging and simple datasets; more formal approaches are used for bounded domains and the need for precise semantics; and reference structures are used when we want to get multiple, disparate sources to communicate and interoperate. So long as these reference structures are coherent and designed for vocabulary expansion and accommodation for synonyms and other means for terminology mapping, they can adapt to changing knowledge and demands.

For too long there has been a misunderstanding and mischaracterization of anything that smacks of structure and referenceability as an attempt to limit diversity, impose control, or suggest some form of “One Ring to rule them all” organization of the semantic Web. Maybe that was true of other suggestions in the past, but it is far from the enabling role of reference structures advocated herein. This reaction to structure has something of the feeling of school children adverse to their writing lessons taking over the classroom and then saying No! to more lessons. Rather than Lord of the Rings we get Lord of the Flies.

To try to overcome this misunderstanding — and to embrace the idea of language and communication for the semantic Web — I and others have tried in the past to find various analogies or imagery to describe the roles of these reference structures. (Again, all of those vagaries of human language and communication!). Analogies for these reference structures have included [13]:

  • backbones, to signal their importance as dependable structures upon which we can put “meat on the bones”
  • scaffoldings, to emphasize their openness and infrastructural role
  • roadmaps, as orienting and navigational frameworks for information
  • docking ports, as connection points for diverse datasets on the Web
  • forest paths, to signal common traversals but with much to discover once we step off the paths
  • infoclines, to represent the information interface between different world views,
  • and others.

What this post has argued is the analogy of reference structures to human language and communication. In this role, reference structures should be seen as facilitating and enabling. This is hardly a vision of constraints and control. The ability to articulate positions and ideas in fact leads to more diversity and freedom, not less.

To be sure, there is extra work in using and applying reference structures. Every child comes to know there is work in learning languages and becoming articulate in them. But, as adults, we also come to learn from experience the frustration that individuals with speech or learning impairments have when trying to communicate. Knowing these things, why do we not see the same imperatives for the semantic Web? We can only get beyond incoherent babblings by making the commitment to learn and master rich languages grounded in appropriate reference structures. We are not compelled to be inchoate; nor are our machines.

Yet, because of this extra work, it is also important that we develop and put in place semi-automatic [14] ways to tag and provide linkages to such reference structures. We have the tools and information extraction techniques available that will allow us to reference and add structure to our content in quick and easy ways. Now is the time to get on with it, and stop babbling about how structure and reference vocabularies may limit our freedoms.

This is the first of a two-part series on the importance and choice of reference structures (Part I) and gold standards (Part II) on the semantic Web.

[1] This is reflected well in a presentation from the NSF Workshop on DB & IS Research for Semantic Web and Enterprises, April 3, 2002, entitled “The “Emergent, Semantic Web: Top Down Design or Bottom Up Consensus?“. This report defines top down as design and committee-driven; bottom up is more decentralized and based on social processes. Also, see Ralf Klischewski, 2003. “Top Down or Bottom Up? How to Establish a Common Ground for Semantic Interoperability within e-Government Communities,” pp. 17-26, in R. Traunmüller and M Palmirani, eds., E-Government: Modelling Norms and Concepts as Key Issues: Proceedings of 1st International Workshop on E-Government at ICAIL 2003, Bologna, Italy. Also, see David Weinberger, 2006. “The Case for Two Semantic Webs,” KM World, May 26, 2006; see http://www.kmworld.com/Articles/ReadArticle.aspx?ArticleID=15809.
[2] For a discussion about formalisms and the nature of the Web, see this early report by F.M. Shipman III and C.C. Marshall, 1994. “Formality Considered Harmful: Experiences, Emerging Themes, and Directions,” Xerox PARC Technical Report ISTL-CSA-94-08-02, 1994; see http://www.csdl.tamu.edu/~shipman/formality-paper/harmful.html.
[3] Others have posited contrasting styles, most often as “top down” v. “bottom up.” However, in one interpretation of that distinction, “top down” means a layer on top of the existing Web; see further, A. Iskold, 2007. “Top Down: A New Approach to the Semantic Web,” in ReadWrite Web, Sept. 20, 2007. The problem with this terminology is that it offers a completely different sense of “top down” to traditional uses. In Iskold’s argument, his “top down” is a layering on top of the existing Web. On the other hand, “top down” is more often understood in the sense of a “comprehensive, engineered” view, consistent with [1].
[4] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.
The closed world assumption (CWA) is a key underpinning to most standard relational data systems and enterprise schema and logics. CWA is the logic assumption that what is not currently known to be true, is false. For semantics-related projects there is a corollary problem to the use of CWA which is the need for upfront agreement on what all predicates “mean”, which is difficult if not impossible in reality when different perspectives are the explicit purpose for the integration.
[5] See M.K. Bergman, 2010. “Two Contrasting Styles for the Semantic Enterprise,” AI3:::Adaptive Information blog post, February 15, 2010. See http://www.mkbergman.com/866/two-contrasting-styles-for-the-semantic-enterprise/.
[6] I first used the term in passing in M.K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog post, May 16, 2007. See http://www.mkbergman.com/374/an-intrepid-guide-to-ontologies/, then more fully elaborated the idea in “Where are the Road Signs for the Structured Web,” AI3:::Adaptive Information blog post, May 29, 2007. See http://www.mkbergman.com/375/where-are-the-road-signs-for-the-structured-web/.
[7] See Catherine C. Marshall and Frank M. Shipman, 2003. “Which Semantic Web?,” in Proceedings of ACM Hypertext 2003, pp. 57-66, August 26-30, 2003, Nottingham, United Kingdom; http://www.csdl.tamu.edu/~marshall/ht03-sw-4.pdf, for a very different (but still accurate and useful) way to characterize the “visions” for the semantic Web. In this early paper, the authors posit three competing visions: 1) the development of standards, akin to libraries, to bring order to digital documents; this is the vision they ascribe to the W3C and has been largely adopted via use of URIs as identifiers, and languages such as RDF and OWL; 2) a vision of a globally distributed knowledge base (which they characterize as being Tim Berners-Lee’s original vision, with examples being Cyc or Apple’s (now disbanded) Knowledge Navigator; and 3) a vision of an infrastructure for the coordinated sharing of data and knowledge..
[8] See Claude E. Shannon‘s classic paper “A Mathematical Theory of Communication” in the Bell System Technical Journal in July and October 1948.
[9] This reference is from the Wikipedia entry on the English language: Kister, Ken. “Dictionaries defined.” Library Journal, 6/15/92, Vol. 117 Issue 11, p 43.
[10] See http://www-943.ibm.com/innovation/us/watson/related-content/toronto.html, or simply do a Web search on “watson toronto jeopardy” (no quotes).
[11] Readability is important because programmers spend the majority of their time reading, trying to understand and modifying existing source code, rather than writing new source code. Unreadable code often leads to bugs, inefficiencies, and duplicated code. It has been known for at least three decades that a few simple readability transformations can make code shorter and drastically reduce the time to understand it. See James L. Elshoff and Michael Marcotty, 1982. “Improving Computer Program Readability to Aid Modification,” Communications of the ACM, v.25 n.8, p. 512-521, Aug 1982; see http://doi.acm.org/10.1145/358589.358596. From the Wikipedia entry on Readability.
[12] According the the Wikipedia entry on Language, there are an estimated 3000 to 6000 active human languages presently in existence.
[13] The forest path analogy comes from Atanas Kiryakov of Ontotext. The remaining analogies come from M.K. Bergman in his AI3:::Adaptive Innovation blog: “There’s Not Yet Enough Backbone,” May 1, 2007 (backbone); “The Role of UMBEL: Stuck in the Middle with you …,” May 11, 2008 (infocline, scaffolding and docking port); “Structure Paves the Way to the Semantic Web,” May 3, 2007 (roadmap).
[14] Semi-automatic methods attempt to apply as much automated screening and algorithmic- or rules-based scoring as possible, and then allow the final choices to be arbitrated by humans. Fully automated systems, particularly involving natural language processing, are not yet acceptable because of (small, but) unacceptably high error rates in precision. The best semi-automated approaches handle all tasks that are rote or error-free, and then limit the final choices to those areas where unacceptable errors are still prevalent. As time goes on, more of these areas can be automated as algorithms, heuristics and methodologies improve. Eventually, of course, this may lead to fully automated approaches.

Posted by AI3's author, Mike Bergman

Posted on February 21, 2011 at 2:27 am in Semantic Web, Structured Web, UMBEL | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/946/seeking-a-semantic-web-sweet-spot/
The URI to trackback this post is: http://www.mkbergman.com/946/seeking-a-semantic-web-sweet-spot/trackback/
Date:   February 7, 2011

Sweet Tools ListingNow Presented as a Semantic Component; Grows to 900+ Tools

Sweet Tools, AI3‘s listing of semantic Web and -related tools, has just been released with its 17th update. The listing now contains more than 900 tools, about a 10% increase over the last version. Significantly the listing is also now presented via its own semantic tool, the structSearch sComponent, which is one of the growing parts to Structured Dynamics‘ open semantic framework (OSF).

So, we invite you to go ahead and try out this new Flex/Flash version with its improved search and filtering! We’re pretty sure you’ll like it.

Summary of Major Changes Sweet Tools structSearch View

Sweet Tools now lists 907 919 tools, an increase of 72 84 (or 8.6 10.1%) over the prior version of 835 tools. The most notable trend is the continued increase in capabilities and professionalism of (some of) the new tools.

This new release of Sweet Tools — available for direct play and shown in the screenshot to the right — is the first to be presented via Structured Dynamics’ Flex-based semantic component technology. The system has greatly improved search and filtering capabilities; it also shares the superior dataset management and import/export capabilities of its structWSF brethren.

As a result, moving forward, Sweet Tools updates will now be added on a more regular basis, reducing the big burps that past releases have tended to follow. We will also see much expanded functionality over time as other pieces of the structWSF and sComponents stack get integrated and showcased using this dataset.

This release is the first in WordPress, and shows the broad capabilities of the OSF stack to be embedded in a variety of CMS or standalone systems. We have provided some updates on Structured Dynamics’ OSF TechWiki for how to modify, embed and customize these components with various Flex development frameworks (see one, two or three), such as Flash Builder or FlashDevelop.

We should mention that the OSF code group is also seeing external parties exposing these capabilities via JavaScript deployments as well. This recent release expands on the conStruct version with its capabilities described in a post about a year ago.

Retiring the Exhibit Version

However, this release does mark the retirement of the very fine Exhibit version of Sweet Tools (an archive version will be kept available until it gets too long in the tooth). I was one of the first to install a commercial Exhibit system, and the first to do so on WordPress, as I described in an article more than four years ago.

Exhibit has worked great and without a hitch, and through a couple of upgrades. It still has (I think) a superior faceting system and sorting capabiities to what we presently offer with our own sComponent alternative. However, the Exhibit version is really a display technology alone, and offers no search, access control or underlying data management capabilities (such as CRUD), all of which are integral to our current system. It is also not grounded in RDF or semantic technologies, though it does have good structural genes. And, Sweet Tools has about reached the limits of the size of datasets Exhibit can handle efficiently.

Exhibit has set a high bar for usability and lightweight design. As we move in a different direction, I’d like again to publicly thank David Huynh, Exhibit’s developer, and the MIT Simile program for when he was there, for putting forward one of the seminal structured data tools of the past five years.

Updated Statistics

The updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories are browser tools (RDF, OWL), information extraction, ontology tools, parsers or converters, and general RDF tools. The relative share by category is shown in this diagram (click to expand):

Sweet Tools Applications

Since the last listing, the fastest growing categories have been utilities (general and RDF) and visualization. Linked data listings have also grown by 200%, but are still a relatively small percentage of the total.

These values should be taken with a couple of grains of salt. First, not all of these additions are organic or new releases. Some are the result of our own tools efforts and investigations, which can often surface prior overlooked tools. Also, even with this large number of application categories, many tools defy characterization, and can reside in multiple categories at once or are even pointing to new ones. So, the splits are illustrative, but not defining.

General language percentages have been keeping pretty constant over the past couple of years. Java remains the leading language with nearly half of all applications, a percentage it has kept steady for four years. PHP continues to grow in popularity, and actually increased the largest percentage amount of any language over this past census. The current language splits are shown in the next diagram (click to expand):

Sweet Tools Languages

C/C++ and C# have really not grown at all over the past year. Again, however, for the reasons noted, these trends should be interpreted with care.

Tasty Dogfood?Dogfood Never Tasted So Good

Tools development is hard and the open source nature of today’s development tends to require a certain critical mass of developer interest and commitment. There are some notable tools that have much use and focus and are clearly professional and industrial grade. Yet, unfortunately, too many of the tools on the Sweet Tools listing are either proofs-of-concept, academic demos, or largely abandoned because of lack of interest by the original developer, the community or the market as a whole.

There is a common statement within the community about how important it is for developers to “eat their own dogfood.” On the face of it, this makes some sense since it conveys a commitment to use and test applications as they are developed.

But looked at more closely, this sentiment carries with it a troublesome reflection of the state of (many) tools within the semantic Web: too much kibble that is neither attractive nor tasty. It is probably time to keep the dogfood in the closet and focus on well-cooked and attractive fare.

We at Structured Dynamics are not trying to hold ourselves up as exemplars or the best chefs of tasty food. We do, however, have a commitment to produce fare that is well prepared and professional. Let’s stop with the dogfood and get on with serving nutritious and balanced fare to the marketplace.

Posted by AI3's author, Mike Bergman

Posted on February 7, 2011 at 1:47 am in Open Source, Semantic Web Tools, Structured Web | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/942/tasty-new-sweet-tools-release/
The URI to trackback this post is: http://www.mkbergman.com/942/tasty-new-sweet-tools-release/trackback/
Date:   November 26, 2010

There’s an Endless Variety of World Views, and Almost as Many Ways to Organize and Describe ThemFriday     Brown Bag Lunch

Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.

The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”

While there have been attempts to strap on more or less formal understandings or machinery around ontology, it still has very much the sense of a world view, a means of viewing and organizing and conceptualizing and defining a domain of interest. As is made clear below, I personally prefer a loose and embracing understanding of the term (consistent with Deborah McGuinness’s 2003 paper, Ontologies Come of Age [1]).

There has been a resurgence of interest in ontologies of late. Two reasons have been the emergence of Web 2.0 and tagging and folksonomies, as well as the nascent emergence of the structured Web. In fact, on April 23-24 one of the noted communities of practice around ontologies, Ontolog, sponsored the Ontology Summit 2007 ,”Ontology, Taxonomy, Folksonomy: Understanding the Distinctions.”

These events have sparked my preparing this guide to ontologies. I have to admit this is a somewhat intrepid endeavor given the wealth of material and diversity of opinions.

Friday      Brown Bag Lunch This Friday brown bag leftover was first placed into the AI3 refrigerator more than three years ago on May 16, 2007. This reprise is unchanged since its original posting, though there is a more recent executive-level intro to ontologies on the OpenStructsTechWiki.

Overview and Role of Ontologies

Of course, a fancy name is not sufficient alone to warrant an interest in ontologies. There are reasons why understanding, using and manipulating ontologies can bring practical benefit:

  • Depending on their degree of formalism (an important dimension), ontologies help make explicit the scope, definition, and language and meaning (semantics) of a given domain or world view
  • Ontologies may provide the power to generalize about their domains
  • Ontologies, if hierarchically structured in part (and not all are), can provide the power of inheritance
  • Ontologies provide guidance for how to correctly “place” information in relation to other information in that domain
  • Ontologies may provide the basis to reason or infer over its domain (again as a function of its formalism)
  • Ontologies can provide a more effective basis for information extraction or content clustering
  • Ontologies, again depending on their formalism, may be a source of structure and controlled vocabularies helpful for disambiguating context; they can inform and provide structure to the “lexicons” in particular domains
  • Ontologies can provide guiding structure for browsing or discovery within a domain, and
  • Ontologies can help relate and “place” other ontologies or world views in relation to one another; in other words, ontologies can organize ontologies from the most specific to the most abstract.

Both structure and formalism are dimensions for classifying ontologies, which combined are often referred to as an ontology’s “expressiveness.” How one describes this structure and formality differs. One recent attempt is this figure from the Ontology Summit 2007‘s wrap-up communique:

Ontology Summit 2007 Communique Diagram

Note the bridging role that an ontology plays between a domain and its content. (By its nature, every ontology attempts to “define” and bound a domain.) Also note that the Summit’s 50 or so participants were focused on the trade-off between semantics v. pragmatic considerations. This was a result of the ongoing attempts within the community to understand, embrace and (possibly) legitimize “less formal” Web 2.0 efforts such as tagging and the folksonomies that can result from them.

There is an M.C. Escher-like recursion of the lizard eating its tail when one observes ontologists creating an ontology to describe the ontological domain. The above diagram, which itself would be different with a slight change in Summit participation or editorship, is, of course, but one representative view of the world. Indeed, a tremendous variety of scientific and research disciplines concern themselves with classifying and organizing the “nature of things.” Those disciplines go by such names as logicians, taxonomists, philosophers, information architects, computer scientists, librarians, operations researchers, systematicists, statisticians, historians, and so forth. (In short, given our ontos, every area of human endeavor has the urge to classify, to organize.) In each of these areas not only do their domains differ, but so do the adopted structures and classification schemes often used.

There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach:

Actual domains or subject coverage are then mostly orthogonal to these approaches.

Loosely defined, the number of possible ontologies is therefore close to infinite: domain X perspective X schema. (Just kidding — sort of! In fact, UMBC’s Swoogle ontology search service claims 10,000 ontologies presently on the Web; the actual data from August 2006 ranges from about 16,000 to 92,000 ontologies, depending on how “formal” the definition. These counts are also limited to OWL-based ontologies.)

Many have misunderstood the semantic Web because of this diversity and the slipperiness of the concept of an ontology. This misunderstanding becomes flat wrong when people claim the semantic Web implies one single grand ontology or organizational schema, One Ring to Rule Them All. Human and domain diversities makes this viewpoint patently false.

Diversity, ‘Naturalness’ and Change

The choice of an ontological approach to organize Web and structured content can be contentious. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language or microformats. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) and the still greater formalism of OWL’s various dialects.

Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it to a canonical form, we can then move on to needed developments in semantic mediation, the threshold condition for the semantic Web.

There are at least 40 concepts — loosely defined — that could be called an “ontology” framework or approach.

So, diversity is inevitable and should be accepted. But that observation need not also embrace chaos.

In my early training in biological systematics, Ernst Haeckel’s recapitulation theory that “ontogeny recapitulates phylogeny” (note the same ontos root, the difference from ontology being growth v. study) was losing favor fast. The theory was that the development of an organism through its embryological phases mirrors its evolutionary history. Today, modern biologists recognize numerous connections between ontogeny and phylogeny, explain them using evolutionary theory, or view them as supporting evidence for that theory.

Yet, like the construction of phylogenetic trees, systematicists strive for their classifications of the relatedness of organisms to be “natural”, to reflect the true nature of the relationship. Thus, over time, that understanding of a “natural” system has progressed from appearance → embryology → embryology + detailed morphology → species and interbreeding → DNA. While details continue to be worked out, the degree of genetic relatedness is now widely accepted by biologists as a “natural” basis for organizing the Tree of Life.

It is not unrealistic to also seek “naturalness” in the organization of other knowledge domains, to seek “naturalness” in the organization of their underlying ontologies. Like natural systems in biology, this naturalness should emerge from the shared understandings and perceptions of the domain’s participants. While subject matter expertise and general and domain knowledge are essential to this development, they are not the only factors. As tagging systems on the Web are showing, common usage and broad acceptance by the community at hand is important as well.

While it may appear that a domain such as the biological relatedness of organisms is more empirical than the concepts and ambiguous words in most domains of human endeavor, these attempts at naturalness are still not foolish. The phylogeny example shows that understanding changes over time as knowledge is gained. We now accept DNA over the recapitulation theory.

As the formal SKOS organizational schema for knowledge systems recognizes (see below), the ideas of narrower and broader concepts can be readily embraced, as well as concepts of relatedness and aliases (synonyms). These simple constructs, I would argue, plus the application of knowledge being gained in related domains, will enable tomorrow’s understandings to be more “natural” than today’s, no matter the particular domain at hand.

So, in seeking a “naturalness” within our organizational schema, we can also see that change is a constant. We also see that the tools and ideas underlying the seemingly abstract cause of merging and relating existing ontologies to one another will further a greater “naturalness” within our organizations of the world.

A Spectrum of Formalisms

According to the Summit, expressiveness is the extent and ease by which an ontology can describe domain semantics. Structure they define as the degree of organization or hierarchical extent of the ontology. They further define granularity as the level of detail in the ontology. And, as the diagram above alludes, they define other dimensions of use, logical basis, purpose and so forth of an ontology.

The over fifty respondents from 42 communities submitted some 70 different ontologies under about 40 terms to a survey that was used by the Summit to construct their diagram. These submissions included:

. . . formal ontologies (e.g., BFO, DOLCE, SUMO), biomedical ontologies (e.g., Gene Ontology, SNOMED CT, UMLS, ICD), thesauri (e.g., MeSH, National Agricultural Library Thesaurus), folksonomies (e.g., Social bookmarking tags), general ontologies (WordNet, OpenCyc) and specific ontologies (e.g., Process Specification Language). The list also includes markup languages (e.g., NeuroML), representation formalisms (e.g., Entity-Relation model, OWL, WSDL-S) and various ISO standards (e.g., ISO 11179). This [Ontolog] sample clearly illustrates the diversity of artifacts collected under “ontology”.

I think the simplest spectrum for such distinctions is the formalism of the ontology and its approach (and language or syntax, not further discussed here). More formal ontologies have greater expressiveness and structure and inferential power, less formal ones the opposite. Constructing more formal ontologies is more demanding, and takes more effort and rigor, resulting in an approach that is more powerful but also more rigid and less flexible. Like anything else, there are always trade-offs.

Based on work by Leo Obrst of Mitre as interpreted by Dan McCreary, we can view this as a trade-off as one of semantic clarity v. the time and money required to construct the formalism [12, 13]:

Structure and Formalism Increases Semantic Expressiveness
[Click on image for full-size pop-up]

Note this diagram reflects the more conventional, practitioner’s view of the “formal” ontology, which does not include taxonomies or controlled vocabularies (for example) in the definition. This represents the more “closely defined” end of the ontology (semantic) spectrum.

However, since we are speaking here of ontologies and the structured Web or the semantic Web, I believe we need to embrace a concept of ontology aligned to actual practice. Not all content providers can or want to employ ontology engineers to enable formal inferencing of their content. Yet, on the other hand, their content in its various forms does have some meaningful structure, some organization. The trick is to extract this structure for more meaningful use such as data exchange or data merging.

Ontology Approaches on the Web

Under such “loosely defined” bases we can thus see a spectrum of ontology approaches on the Web, proceeding from less structure and formalism to more so:

Type or Schema Examples Comments on Structure and Formalism
Standard Web Page entire Web General metadata fields in the and internal HTML codes and tags provide minimal, but useful sources of structure; other HTTP and retrieval data can also contribute
Blog / Wiki Page examples from Technorati, Bloglines, Wikipedia Provides still greater formalism for the organization and characterization of content (subjects, categories, posts, comments, date/time stamps, etc.). Importantly, with the addition of plug-ins, some of the basic software may also provide other structured characterizations or output (SIOC, FOAF, etc.; highly varied and site-specific given the diversity of publishing systems and plug-ins)
RSS / Atom feeds most blogs and most news feeds RSS extends basic XML schema for more robust syndication of content with a tightly controlled vocabulary for feed concepts and their relationships. Because of its ubiquity, this is becoming a useful baseline of structure and formalism; also, the nature of adoption shows much about how ontological structure is an artifact, not driver, for use
RSS / Atom feeds with tags or OPML Grazr, most newsfeed aggregators can import and export OPML lists of RSS feeds The OPML specification defines an outline as a hierarchical, ordered list of arbitrary elements. The specification is fairly open which makes it suitable for many types of list data. See also OML and XOXO
Hierarchical Faceted Metadata XFML, Flamenco These and related efforts from the information architecture (IA) community are geared more to library science. However, they directly contribute to faceted browsing, which is one of the first practical instantiations of the semantic Web
Folksonomies Flickr, del.icio.us Based on user-generated tags and informal organizations of the same; not linked to any “standard” Web protocols. Both tags and hierarchical structure are arbitrary, but some researchers now believe over large enough participant sets that structural consensus and value does emerge
Microformats Example formats include hAtom, hCalendar, hCard, hReview, hResume, rel-directory, xFolk, XFN and XOXO A microformat is HTML mark up to express semantics with strictly controlled vocabularies. This markup is embedded using specific HTML attributes such as class, rel, and rev. This method is easy to implement and understand, but is not free-form
Embedded RDF RDFa, eRDF An embedded format, like microformats, but free-form, and not subject to the approval strictures associated with microformats
Topic Maps Infoloom, Topic Maps Search Engine A topic map can represent information using topics (representing any concept, from people, countries, and organizations to software modules, individual files, and events), associations (which represent the relationships between them), and occurrences (which represent relationships between topics and information resources relevant to them)
RDF Many; DBpedia, etc. RDF has become the canonical data model since it represents a “universal” conversion format
RDF Schema SKOS, SIOC, DOAP, FOAF RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. This becomes the canonical ontology common meeting ground
OWL Lite Here are some existing OWL ontologies; also see Swoogle for OWL search facilities The Web Ontology Language (OWL) is a language for defining and instantiating Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans. It facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. The three language versions are in order of increasing expressiveness
OWL DL
OWL Full
Higher-order “formal” and “upper-level” ontologies SUMO, DOLCE, PROTON, BFO, Cyc, OpenCyc These provide comprehensive ontologies and often related knowledge bases, with the goal of enabling AI applications to perform human-like reasoning. Their reasoning languages often use higher-order logics

As a rule of thumb, items that are less “formal” can be converted to a more formal expression, but the most formal forms can generally not be expressed in less formal forms.

As latter sections elaborate, I see RDF as the universal data model for representing this structure into a common, canonical format, with RDF Schema (specifically SKOS, but also supplemented by FOAF, DOAP and SIOC) as the organizing ontology knowledge representation language (KRL).

This is not to say that the various dialects of OWL should be neglected. In bounded environments, they can provide superior reasoning power and are warranted if they can be sufficiently mandated or enforced. But the RDF and RDF-S systems represent the most tractable “meeting place” or “middle ground,” IMHO.

Still-Another “Level” of Ontologies

As if the formalism dimension were not complicated enough, there is also the practice within the ontology community to characterize ontologies by “levels”, specifically upper, middle and lower levels. For example, chances are that you have heard particularly of “upper-level” ontologies.

The following figure helps illustrate this “level” dimension. This diagram is also from Leo Obrst of Mitre [12], and was also used in another 2006 talk by Jack Park and Patrick Durusau (discussed further below for other reasons):

Ontology Levels

Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak [2].

The above diagram conveys a sense of how multiple ontologies can relate to one another both in terms of narrower and broader topic matter and at the same “levels” of generalization. Such “meta-structure” (if you will) can provide a reference structure for relating multiple ontologies to one another.

The relationships and mappings amongst ontologies is a critical infrastructure component of the semantic Web.

It resides exactly in such bindings or relationships that we can foresee the promise of querying and relating multiple endpoints on the Web with accurate semantics in order to connect dots and combine knowledge bases. Thus, the understanding of the relationships and mappings amongst ontologies becomes a critical infrastructural component of the semantic Web.

The SUMO Example

We can better understand these mapping and inter-relationship concepts by using a concrete example with a formal ontology. We’ll choose to use the Suggested Upper Merged Ontology simply because it is one of the best known. We could have also selected another upper-level system such as PROTON [3] or Cyc [4] or one of the many with narrower concept or subject coverage.

SUMO is one of the formal ontologies that has been mapped to the WordNet lexicon, which adds to its semantic richness. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE. The ontologies that extend SUMO are available under GNU General Public License.

The abstract, conceptual organization of SUMO is shown by this diagram, which also points to its related MILO (MId-Level Ontology), which is being developed as a bridge between the abstract content of the SUMO and the richer detail of various domain ontologies:

At this level, the structure is quite abstract. But one can easily browse the SUMO structure. A nifty tool to do so is the KSMSA (Knowledge Support for Modeling and Simulation) ontology browser. Using a hierarchical tree representation, you can navigate through SUMO, MILO, WordNet, and (with the locally installed version) Wikipedia.

The figure below shows the upper-level entity concept on the left; the right-hand panel shows a drill-down into the example atom entity:

Example SUMO Categories
[Click on image for full-size pop-up]

These views may be a bit misleading because the actual underlying structure, while it has hierarchical aspects as shown here, really is in the form of a directed acyclic graph (showing other relatedness options, not just hierarchical ones). So, alternate visualizations include traditional network graphs.

The other thing to note is that the “things” covered in the ontology, the entities, are also fairly abstract. That is because the intention of a standard “upper-level” ontology is to cover all relevant knowledge aspects of each entity’s domain. This approach results in a subject and topic coverage that feels less “concrete” than the coverage in, say, an encyclopedia, directory or card catalog.

Ontology Binding and Integration Mechanisms

According to Park and Durusau, upper ontologies are diverse, middle ontologies are even more diverse, and lower ontologies are more diverse still. A key observation is that ontological diversity is a given and increases as we approach real user interaction levels. Moreover, because of the “loose” nature of ontologies on the Web (now and into the future), diversity of approach is a further key factor.

Recall the initial discussion on the role and objectives of ontologies. About half of those roles involve effectively accessing or querying more than one ontology. The objective of “upper-level” ontologies, many with their own binding layers, is also expressly geared to ontology integration or federation. So, what are the possible mechanisms for such binding or integration?

A fundamental distinction within mechanisms to combine ontologies is whether it is a unified or centralized approach (often imposed or required by some party) or whether it is a schema mapping or binding approach. We can term this distinction centralized v. federated.

Centralized Approaches

Centralized approaches can take a number of forms. At the most extreme, adherence to a centralized approach can be contractual. At the other end are reference models or standards. For example, illustrative reference models include:

  • the Data Reference Model (DRM), one of the five reference models of the Federal Enterprise Architecture (FEA)
  • UDEF (Unified Data Element Framework), an approach toward a unified description framework, or
  • the eXtended MetaData Registry (XMDR) project.

Though I have argued that One Ring to Rule them All is not appropriate to the general Web, there may be cases within certain enterprises or where through funding clout (such as government contracts), some form of centralized approach could be imposed [5]. And, frankly, even where compliance can not be assured, there are advantages in economy, efficiency and interoperability to attempt central ontologies. Certain industries — notably pharmaceuticals and petrochemicals — and certain disciplines — such as some areas of biology among others — have through trade associations or community consensus done admirable jobs in adopting centralized approaches.

Federated Approaches

However, combining ontologies in the context of the broader Internet is more likely through federated approaches. (Though federated approaches can also be improved when there are consensual standards within specific communities.) The key aspect of a federated approach is to acknowledge that multiple schema need to be brought together, and that each contributing data set and its schema will not be altered directly and will likely remain in place.

Thus, the key distinctions within this category are the mechanisms by which those linkages may take place An important goal in any federated approach is to achieve interoperability at the data or instance level without unacceptable loss of information or corruption of the semantics. Numerous specific approaches are possible, but three example areas in RDF-topic map interoperability, the use of “subject maps”, and binding layers can illustrate some of the issues at hand.

In 2006 the W3C set up a working group to look at the issue of RDF and topic maps interoperability. Topic maps have been embraced by the library and information architecture community for some time, and have standards that have been adopted under ISO. Somewhat later but also in parallel was the development of the RDF standard by W3C. The interesting thing was that the conceptual underpinnings and objectives between these two efforts were quite similar. Also, because of the substantive thrust of topic maps and the substantive needs of its community, quite a few topic maps had been developed and implemented.

One of the first efforts of the W3C work group was to evaluate and compare five or six extant proposals for how to relate RDF and topic maps [6]. That report is very interesting reading for any one desirous of learning more about specific issues in combining ontologies and their interoperability. The result of that evaluation then led to some guidelines for best practices in how to complete this mapping [7]. Evaluations such as these provide confidence that interoperability can be achieved between relatively formal schema definitions without unacceptable loss in meaning.

A different, “looser” approach, but one which also grew out of the topic map community, is the idea of “subject maps.” This effort, backed by Park and Durusau noted above, but also with the support of other topic map experts such as Steve Newcomb and Robert Barta via their proposed Topic Maps Reference Model (ISO 13250-5), seems to be one of the best attempts I’ve seen that both respects the reality of the actual Web while proposing a workable, effective scheme for federation.

The basic idea of a subject map is built around a set of subject “proxies.” A subject proxy is a computer representation of a subject that can be implemented as an object, must have an identity, and must be addressable (this point provides the URI connector to RDF). Each contributing schema thus defines its own subjects, with the mappings becoming meta-objects. These, in turn, would benefit from having some accepted subject reference schema (not specifically addressed by the proponents) to reduce the breadth of the ultimate mapped proxy “space.”

I don’t have the expertise to judge further the specifics, but I find the presentation and papers by Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies to be worthwhile reading in any case. I highly recommend these papers for further background and clarity.

As the third example, “binding layers” are a comparatively newer concept. Leading upper-level ontologies such as SUMO or PROTON propose their own binding protocols to their “lower” domains, but that approach takes place within the construct of the parent upper ontology and language. Such designs are not yet generalized solutions. By far the most promising generalized binding solution is the SKOS (Simple Knowledge Organization System). Because of its importance, the next section is devoted to it.

Finally, with respect to federated approaches, there are quite a few software tools that have been developed to aid or promote some of these specific methods. For, example, about twenty of the software applications in my Sweet Tools listing of 500+ semantic Web and -related tools could be interpreted as aiding ontology mapping or creation. You may want to check out some of these specific tools depending on your preferred approach [8].

The Role of SKOS – the Simple Knowledge Organization System

SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the “loosely defined” ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide [9].

SKOS is built upon the RDF data model of the subject-predicate-object “triple.” The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).

Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.

This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:

Standard RDF Graph Model

The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as “broader” and “narrower”, which enable hierarchical relations to be modeled, as well as “related” and “member” to support networks and arrays, respectively [9].

We can visualize this transforming power by looking at how an “ontology” in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:

Example Structural Comparison of Hierarchical Taxonomy with Network Graph
[Click on image for full-size pop-up]

SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it [9]. There is also a useful draft schema that the W3C’s SWEO (Semantic Web Education and Outreach) group is developing to organize semantic Web-related information [10].

Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.

Conclusions

While not addressed directly in this piece, it is obviously of first importance to have content with structure before the questions of connecting that information can even arise. Then, that structure must also be available in a form suitable for merging or connection.

At that point, the subjects of this posting come into play.

We are stubbing our toes on the rocks while we gaze at the heavens.

We see that the daily Web has a diversity of schema or ontologies “loosely defined” for representing the structure of the content. These representations can be transferred to more complex schema, but not in the opposite direction. Moreover, the semantic basis for how to make these mappings also needs some common referents.

RDF provides the canonical data model for the data transfers and representations. RDFS, especially in the form of SKOS, appears to form one basis for the syntax and language for these transformations. And SKOS, with other schema, also appears to offer much of the appropriate “middle ground” for data relationships mapping.

However, lacking in this story is a referential structure for subject relationships [11]. (Also lacking are the ultimately critical domain specifics required for actual implementation.)

Abstract concepts of interest to philosophers and deep thinkers have been given much attention. Sadly, to date, concrete subject structures in which tangible things and tangible actions can be shared, is still very, very weak. We are stubbing our toes on the rocks while we gaze at the heavens.

Yet, despite this, simple and powerful infrastructures are well in-hand to address all foreseeable syntactic and semantic issues. There appear to be no substantive limits to needed next steps.

Lastly, many valuable resources for further reading and learning may be found within the Ontolog Community, W3C, TagCommons and Topics Maps groups. Enjoy! And be wary of ontology no longer.


[1] Deborah L. McGuinness. “Ontologies Come of Age”. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2003. See http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm
[2] I think it would be much clearer to refer to “upper level” ontologies as abstract or conceptual, “mid levels” as mapping or binding, and “lower levels” as domain (without any hierarchical distinctions such as lower or lowest or sub-domain), but current practice is probably too entrenched to change now.
[3] There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.
[4] See my earlier post on Cyc.
[5] Even with such clout, it is questionable to get rather complete adherence, as Ada showed within the Federal government. However, where circumstances allow it, central schema and ontologies may be worth pursuing because of improved interoperability and lower costs, even where some portions do not adhere and are more chaotic like the standard Web.
[6] See, A Survey of RDF/Topic Maps Interoperability Proposals, W3C Working Group Note 10 February 2006, Pepper, Vitali, Garshol, Gessa, Presutti (eds.)
[7] See, Guidelines for RDF/Topic Maps Interoperability, W3C Editor’s Draft 30 June 2006, Pepper, Presutti, Garshol, Vitali (eds.)
[8] Here are some Sweet Tools that may have a usefulness to ontology federation and creation:
  • Adaptiva — is a user-centered ontology building environment, based on using multiple strategies to construct an ontology, minimising user input by using adaptive information extraction
  • Altova SemanticWorks — is a visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design
  • CMS — the CROSI Mapping System is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
  • ConcepTool — is a system to model, analyze, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
  • ConRef — is a service discovery system which uses ontology mapping techniques to support different user vocabularies
  • FOAM — is the Framework for Ontology Alignment and Mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
  • hMAFRA (Harmonize Mapping Framework) — is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
  • IF-Map — is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
  • IODT — is IBM’s toolkit for ontology-driven development. The toolkit includes EMF Ontology Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva)
  • KAON — is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. An important focus of KAON is scalable and efficient reasoning with ontologies
  • LinKFactory — is Language & Computing’s ontology management tool. It provides an effective and user-friendly way to create, maintain and extend extensive multilingual terminology systems and ontologies (English, Spanish, French, etc.). It is designed to build, manage and maintain large, complex, language independent ontologies
  • M3t4.Studio Semantic Toolkit — is Metatomix’s free set of Eclipse plug-ins to allow developers to create and manage OWL ontologies and RDF documents
  • MAFRA Toolkit — the Ontology MApping FRAmework Toolkit allows to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
  • OntoEngine — is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target.”
  • OntoPortal — enables the authoring and navigation of large semantically-powered portals
  • OWLS-MX — the hybrid semantic Web service matchmaker OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
  • pOWL — is a semantic Web development platform for ontologies in PHP. pOWL consists of a number of components, including RAP
  • Protege — is an open source visual ontology editor written in Java with many plug-in tools
  • Semantic Net Generator — is a utility for generating topic maps automatically from different data sources by using rules definitions specified with Jelly XML syntax. This Java library provides Jelly tags to access and modify data sources (also RDF) to create a semantic network
  • SOFA — is a Java API for modeling ontologies and Knowledge Bases in ontology and Semantic Web applications. It provides a simple, abstract and language neutral ontology object model, inferencing mechanism and representation of the model with OWL, DAML+OIL and RDFS languages
  • Terminator — is a tool for creating term to ontology resource mappings (documentation in Finnish)
  • WebOnto — supports the browsing, creation and editing of ontologies through coarse grained and fine grained visualizations and direct manipulation.
[9] The SKOS language has the following classes:
  • CollectableProperty — A property which can be used with a skos:Collection
  • Collection — A meaningful collection of concepts
  • Concept — An abstract idea or notion; a unit of thought
  • ConceptScheme — A set of concepts, optionally including statements about semantic relationships between those concepts. Thesauri, classification schemes, subject heading lists, taxonomies, ‘folksonomies’, and other types of controlled vocabulary are all examples of concept schemes. Concept schemes are also embedded in glossaries and terminologies.
  • OrderedCollection — An ordered collection of concepts, where both the grouping and the ordering are meaningful
. . . and the following properties:
  • altLabel — An alternative lexical label for a resource. Acronyms, abbreviations, spelling variants, and irregular plural/singular forms may be included among the alternative labels for a concept
  • altSymbol — An alternative symbolic label for a resource
  • broader — A concept that is more general in meaning. Broader concepts are typically rendered as parents in a concept hierarchy (tree)
  • changeNote — A note about a modification to a concept
  • definition — A statement or formal explanation of the meaning of a concept
  • editorialNote — A note for an editor, translator or maintainer of the vocabulary
  • example — An example of the use of a concept
  • hasTopConcept — A top level concept in the concept scheme
  • hiddenLabel — A lexical label for a resource that should be hidden when generating visual displays of the resource, but should still be accessible to free text search operations
  • historyNote — A note about the past state/use/meaning of a concept
  • inScheme — A concept scheme in which the concept is included. A concept may be a member of more than one concept scheme
  • isPrimarySubjectOf — A resource for which the concept is the primary subject
  • isSubjectOf –A resource for which the concept is a subject
  • member — A member of a collection
  • memberList — An RDF list containing the members of an ordered collection
  • narrower — A concept that is more specific in meaning. Narrower concepts are typically rendered as children in a concept hierarchy (tree)
  • note — A general note, for any purpose. The other human-readable properties of definition, scopeNote, example, historyNote, editorialNote and changeNote are all sub-properties of note
  • prefLabel — The preferred lexical label for a resource, in a given language. No two concepts in the same concept scheme may have the same value for skos:prefLabel in a given language
  • prefSymbol — The preferred symbolic label for a resource
  • primarySubject — A concept that is the primary subject of the resource. A resource may have only one primary subject per concept scheme
  • related — A concept with which there is an associative semantic relationship
  • scopeNote — A note that helps to clarify the meaning of a concept
  • semanticRelation — A concept related by meaning. This property should not be used directly, but as a super-property for all properties denoting a relationship of meaning between concepts
  • subject — A concept that is a subject of the resource
  • subjectIndicator — A subject indicator for a concept. [The notion of 'subject indicator' is defined here with reference to the latest definition endorsed by the OASIS Published Subjects Technical Committee]
  • symbol — An image that is a symbolic label for the resource. This property is roughly analagous to rdfs:label, but for labelling resources with images that have retrievable representations, rather than RDF literals. Symbolic labelling means labelling a concept with an image.
[10] The SWEO classification ontology is still under active development and has these draft classes. Note, however, the relative lack of actual subject or topic matter:
Classes are currently defined as:
  • article – magazine article
  • blog – blog discussing SW topics
  • book – indicates a textbook, applies to the book’s home page, review or listing in Amazon or such
  • casestudy – Article on a business case
  • conference/event – conferences or events where you can learn about the Semantic Web
  • demo/demonstration – interactive SW demo
  • forum – a forum on semantic web or related topics
  • presentation – Powerpoint or similar slide show
  • person – If this is a person’s home page or blog, see below
  • publication – a scientific publication
  • ontology – a formalisation of a shared conceptualization using OWL, RDFS, SKOS or something else based on RDF
  • organization – If the page is the home page of an organization, research, vendor etc, see below
  • portal – a portal website Semantic Web or related topics, usually hosting information items, mailinglists, community tools
  • project – a research (for example EU-IST) or other project that addresses Semantic Web issues
  • mailinglist – a mailinglist on semantic Web or related topics
  • person – ideally a person that is well known regarding the Semantic Web (people who can do keynote speakers), may also be any related person
  • press – a press release by a company or an article about Semantic Web
  • recommended – If the resource is seen to be in the top 10 of its kind
  • specification – a Semantic Web specification (RDF, RDF/S, OWL, etc)
  • categories – (perhaps using tags or other free form annotation
  • successstory – Article that can contain advertisment and clearly shows the benefit of semantic web
  • tutorial – a tutorial teaching some aspect of semantic web, an example
  • vocabulary – a RDF vocabulary
  • software project/tool – For product/project home pages
If the page describes an organization, it can be tagged as:
  • vendor
  • research
  • enduser
If the page is a person’s home page or blog or similar, it could be:
  • opinionleader
  • researcher
  • journalist
  • executive
  • geek
The type of audience can also be tagged, for example:
  • general public
  • beginners
  • technicians
  • researchers.
[11] The OASIS Topic Maps Published Subjects Technical Committee was formed a number of years back to promote Topic Maps interoperability through the use of Published Subjects Indicators (PSIs). Their resulting report was a very interesting effort that unfortunately did not lead to wide adoption, perhaps because the effort was a bit ahead of its time or it was in advance of the broader acceptance of RDF. This general topic is the subject of a later posting by me.
[12] See further, Leo Obrst, “The Semantic Spectrum & Semantic Models,” a Powerpoint presentation (http://ontolog.cim3.net/file/resource/presentation/LeoObrst_20060112/OntologySpectrumSemanticModels–LeoObrst_20060112.ppt)
made as part of an Ontolog Forum (http://ontolog.cim3.net/) presentation in two parts, “What is an Ontology? – A Briefing on the Range of Semantic Models” (see http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2006_01_12), in January 2006. Leo Obrst is a principal artificial intelligence scientist at MITRE’s (http://www.mitre.org) Center for Innovative Computing and Informatics and a co-convener of the Ontolog Forum. His presentation is a rich source of practical overview information on ontologies.
[13] The actual diagram is an unattributed modification by Dan McCreary (see http://www.danmccreary.com/presentations/sem_int/sem_int.ppt) based on Obrst’s material in [12].

Posted by AI3's author, Mike Bergman

Posted on November 26, 2010 at 2:43 am in Adaptive Information, Brown Bag Lunch, Description Logics, Ontologies, Semantic Web, Structured Web | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/936/brown-bag-lunch-an-intrepid-guide-to-ontologies/
The URI to trackback this post is: http://www.mkbergman.com/936/brown-bag-lunch-an-intrepid-guide-to-ontologies/trackback/
Date:   November 22, 2010

Horse by RuthThe Reality is: Most Connections are Proximate

What does it mean to interoperate information on the Web? With linked data and other structured data now in abundance, why don’t we see more information effectively combined? Why express your information as linked data if no one is going to use it?

Interoperability comes down to the nature of things and how we describe those things or quite similar things from different sources. This was the major thrust of my recent keynote presentation to the Dublin Core annual conference. In that talk I described two aspects of the semantic “gap”:

  1. One aspect is the need for vetted reference sources that provide the entities and concepts for aligning disparate content sources on the Web, and
  2. A second aspect is the need for accurate mapping predicates that can represent the often approximate matches and overlaps of this heterogeneous content.

I’ll discuss the first “gap” in a later post. What we’ll discuss here is the fact that most relationships between putatively same things on the Web are rarely exact, and are most often approximate in nature.

“It Ain’t the Label, Mabel”

The use of labels for matching or descriptive purposes was the accepted practice in early libraries and library science. However, with the move to electronic records and machine bases for matching, appreciation for ambiguities and semantics have come to the fore. Labels are no longer an adequate — let alone a sufficient — basis for matching references.

The ambiguity point is pretty straightforward.  Refer to Jimmy Johnson by his name, and you might be referring to a former football coach, a NASCAR driver, a former boxing champ, a blues guitarist, or perhaps even a plumber in your home town. Or perhaps none of these individuals. Clearly, the label “Jimmy Johnson” is insufficient to establish identity.

Of course, not all things are named entities such as a person’s name. Some are general things or concepts. But, here, semantic heterogeneities can also lead to confusion and mismatches. It is always helpful to revisit the sources and classification of semantic heterogeneities, which I first discussed at length nearly five years ago. Here is a schema classifying more than 40 categories of potential semantic mismatches [1]:

Class

Category

Subcategory

STRUCTURAL Naming Case Sensitivity
Synonyms
Acronyms
Homonyms
Generalization / Specialization
Aggregation Intra-aggregation
Inter-aggregation
Internal Path Discrepancy
Missing Item Content Discrepancy
Attribute List Discrepancy
Missing Attribute
Missing Content
Element Ordering
Constraint Mismatch
Type Mismatch
DOMAIN Schematic Discrepancy Element-value to Element-label Mapping
Attribute-value to Element-label Mapping
Element-value to Attribute-label Mapping
Attribute-value to Attribute-label Mapping
Scale or Units
Precision
Data Representation Primitive Data Type
Data Format
DATA Naming Case Sensitivity
Synonyms
Acronyms
Homonyms
ID Mismatch or Missing ID
Missing Data
Incorrect Spelling
LANGUAGE Encoding Ingest Encoding Mismatch
Ingest Encoding Lacking
Query Encoding Mismatch
Query Encoding Lacking
Languages Script Mismatches
Parsing / Morphological Analysis Errors (many)
Syntactical Errors (many)
Semantic Errors (many)

Even with the same label, two items in different information sources can refer generally to the same thing, but may not be the same thing or may define it with a different scope and content. In broad terms, these mismatches can be due to structure, domain, data or language, with many nuances within each type.

The sameAs approach used by many of the inter-dataset linkages in linked data ignores these heterogeneities. In a machine and reasoning sense, indeed even in a linking sense, these assertions can make as little or nonsensical sense as talking about the plumber with the facts about the blues guitarist.

Cats, Paul Newman and Great Britain

Let’s take three examples where putatively we are talking about the same thing and linking disparate sources on the Web.Great Britain Usages

The first example is the seemingly simple idea of “cats”. In one source, the focus might be on house cats, in another domestic cats, and in a third, cats as pets. Are these ideas the same thing? Now, let’s bring in some taxonomic information about the cat family, the Felidae. Now, the idea of “cats” includes lynx, tigers, lions, cougars and many other kinds of cats, domestic and wild (and, also extinct!). Clearly, the “cat” label used alone fails us miserably here.

Another example is one that Fred Giasson and I brought up one year ago in When Linked Data Rules Fail [2]. That piece discussed many poor practices within linked data, and used as one case the treatment of articles in the New York Times about the (deceased) actor Paul Newman. The NYT dataset is about various articles written about people historically in the newspaper. Their record about Paul Newman was about their pool of articles with attributes such as first published and so forth, with no direct attribute information about Paul Newman the person. Then, they asserted a sameAs relationship with external records in Freebase and DBpedia, which acts to commingle person attributes like birth, death and marriage with article attributes such as first and last published. Clearly, the NYT has confused the topic ( Paul Newman) of a record with the nature of that record (articles about topics). This misunderstanding of the “thing” at hand makes the entailed assertions from the multiple sources illogical and useless [3].

Our third example is the concept or idea or named entity of Great Britain. Depending on usage and context, Great Britain can refer to quite different scopes and things. In one sense, Great Britain is an island. In a political sense, Great Britain can comprise the territory of England, Scotland and Wales. But, even more precise understandings of that political grouping may include a number of outlying islands such as the Isle of Wight, Anglesey, the Isles of Scilly, the Hebrides, and the island groups of Orkney and Shetland. Sometimes the Isle of Man and the Channel Islands, which are not part of the United Kingdom, are fallaciously included in that political grouping. And, then, in a sporting context, Great Britain may also include Northern Ireland. Clearly, these, plus other confusions, can mean quite different things when referring to “Great Britain.” So, without definition, a seemingly simple question such as what the population of Great Britain is could legitimately return quite disparate values (not to mention the time dimension and how that has changed boundaries as well!).

These cases are quite usual for what “things” mean when provided from different sources with different perspectives and with different contexts. If we are to get meaningful interoperation or linkage of these things, we clearly need some different linking predicates.

Some Attempts at ‘Approximateness’

The realization that many connections across datasets on the Web need to be “approximate” is growing. Here is the result of an informal survey for leading predicates in this regard [4]:

  • skos:broadMatch
  • skos:related
  • ore:similarTo
  • dul:associatedWith
  • umbel:isAbout
  • skos:narrowMatch
  • vmf:isInVocabulary
  • skos:closeMatch
  • owl:equivalentClass
  • skos:mappingRelation
  • ov:similarTo
  • umbel:hasMapping
  • doape:similarThing
  • lvont:nearlySameAs
  • umbel:isRelatedTo
  • umbel:isLike
  • skos:exactMatch
  • sswap:hasMapping
  • umbel:hasCharacteristic
  • lvont:somewhatSameAs
  • dul:isAbout
  • skos:semanticRelation
  • rdfs:seeAlso
  • ore:describes
  • skos:narrowerTransitive
  • map:narrowerThan
  • dul:isConceptualizedBy
  • skos:narrower
  • umbel:isCharacteristicOf
  • prowl:defineUncertaintyOf
  • dc:subject
  • sumo:entails
  • link:uri
  • foaf:isPrimaryTopicOf
  • skos:broaderTransitive
  • dul:isComponentOf
  • foaf:focus
  • skos:relatedMatch
  • map:broaderThan
  • owl:sameAs
  • skos:broader
  • dul:isAssignedTo
  • wn:similarTo
  • sumo:refers
  • rdfs:subClassOf

Besides the standard OWL and RDFS predicates, SKOS, UMBEL and DOLCE [5] provide the largest number of choices above. In combination, these predicates probably provide a good scoping of “approximateness” in mappings.

Rationality and Reasoners

It is time for some leadership to emerge to provide a more canonical set of linking predicates for these real-world connection requirements. It would also be extremely useful to have such a canonical set adopted by some leading reasoners such that useful work could be done against these properties.


[1] See M. K. Bergman, 2006. “Sources and Classification of Semantic Heterogeneities,” AI3:::Adaptive Information blog, June 6, 2006. See http://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/.
[2] See M. K. Bergman and F. Giasson, 2009. “When Linked Data Rules Fail,” AI3:::Adaptive Information blog, November 16, 2009. See http://www.mkbergman.com/846/when-linked-data-rules-fail/.
[3] On a different disappointing note, the critical errors that we noted a year ago and the NYT’s own acknowledgement on its site that:
“An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.”
has still not been corrected, now a year later. Poor performance like this by a professional publisher gives linked data a bad name.
[4] These predicates have been obtained from personal knowledge and directed searches using the Falcons ontology search service. Simple Web searches on the namespace plus predicate name will provide more detail on any given predicate.
[5] UMBEL (Upper Mapping and Binding Exchange Layer) is an ontology of about 20,000 subject concepts that acts as a reference structure for inter-relating disparate datasets. It is also a general vocabulary of classes and predicates designed for the creation of domain-specific ontologies. For SKOS, see Alistair Miles and Sean Bechhofer, eds., 2009. SKOS Simple Knowledge Organization System Reference, W3C Recommendation, 18 August 2009; http://www.w3.org/TR/skos-reference/. The Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) is one of the more popular upper ontologies.

Posted by AI3's author, Mike Bergman

Posted on November 22, 2010 at 1:23 am in Adaptive Information, Linked Data, Semantic Web, Structured Web | Comments (7)
The URI link reference to this post is: http://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/
The URI to trackback this post is: http://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/trackback/
Page 3 of 1912345...10...Last »
Copyright © 2004–2013 Michael K. Bergman.   This work is licensed under a Creative Commons License