Evolution
AI³
Adaptive Information
Adaptive Innovation
Adaptive Infrastructure
a·dap·tive adj. Showing or having a capacity to make fit for new or special situations; flexible; a successful adjustment.

Blogasbörd (cloud version):
Send Email   Get SIOC Profile   Get FOAF Profile   Syndicate full contents for this site using RSS 20
Main Links
Categories
Calendar
February 2013
S M T W T F S
« Jan    
 12
3456789
10111213141516
17181920212223
2425262728  
Archives
More . . .  
Credits
Blog software courtesy of WordPress Site Meter View Mike's profile on LinkedIn
6276
Search
Date:   April 2, 2008

Part 4 of 4 on Foundations to UMBEL

CycorpJust as DBpedia has provided the nucleating point for linking instance data (see Part 2), UMBEL is designed to provide a similar reference structure for concepts. These concepts provide some fixed positions in space to which other sources can link and relate. And, like references for instance data, the existence of reference concepts can greatly diminish the number of links necessary in the Linked Data environment.

Clearly, the combination of the representativeness of UMBEL’s subject concepts (the “scope” of the ontology) and their relationships (the “structure” of the backbone) is fundamental. These factors in turn express the functional capabilities of the system.

First Things First

The first fundamental point deserving emphasis is that a reference structure of almost any nature has value. We can argue later about what is the best reference structure, but the first task is to just get one in place and begin bootstrapping. Indeed, over time, it is likely that a few reference structures will emerge and compete and get supplemented by still further structures. This evolution is expected and natural and desirable in that it provides choice and options.

A reference structure of concepts has the further benefit of providing a logical reference structure for instances as well. While Wikipedia is perhaps the most comprehensive collection of humanity-wide instances, no single source can or will be complete in scope. Thus, we foresee specialty sources ranging from the companies in Wikicompany to plants and animals in the Encyclopedia of Life or thousands of other rich instance sources also acting as reference hubs.

How do each of these rich instance sources relate to one another? What is the subject concept or topical basis by which they overlap or complement? What is the framework and graph structure of knowledge to give this information context? These are the benefits brought by a structure of reference concepts, independent from the specifics of the reference structure itself.

Another key consideration is that broad-scale acceptance is important. An express purpose of UMBEL is to aid the interconnection of related content using broadly accepted foundations.

Alternative Approaches

Since the Web’s inception fifteen years ago, there have been various alternatives tried or in ascendance for organizing and bringing structure to Web content. Some of these may be too static and inflexible, others perhaps too arbitrary or parochial. All approaches to date have had little collective success.

There are also new and exciting developments in social networks and user-driven content and structure arising from areas such as tagging or Wikipedia (and wikis in general). But it is not clear that bottom-up contributions suitable to individual articles or topics can lead to coherent structural frameworks; arguably, they have not yet so far. And then there are sporadic government or corporate or trade association initiatives as well.

Here is a summary of alternate approaches:

  • Existing library systems — Dewey Decimal Classification, Library of Congress, UDC and many other library classification schemes have been touted for the Web and all have failed. Some reasons cited for this failure are physical books are very different from free digital bits; Web schema need to evolve quickly; and lack of stewards and curation
  • Market share — at various times certain successful vendors have held temporary minor ascendance with content organizational frameworks, generally directory structures. Examples include About, Yahoo!, Open Directory Project (DMOZ), Northern Light, etc. Yet even at their peaks, market shares were low, external adoption was rare, scope was questioned and arbitrary, with interest in directories now nearly absent
  • WordNet — though of strong interest and use to computational linguists, and quite popular for many content analyses, WordNet has seen little consumer or commercial interest. However, the synset structure and its coverage is extremely valuable for concept disambiguation, and therefore has a role in UMBEL (as it does in many other online systems)
  • Standards efforts — some sporadic success and some notable failures have occurred in the standards arenas. Generally, the successful initiatives tend to be in close communities where there are clear financial benefits for adherence, such as in the exchange of financial or commerce data; broader and more ambitious efforts have tended to be less successful
  • Professional organizations and associations — areas such a finance, pharmaceuticals, biologists, physicists and many bounded communities have enjoyed sporadic and sometimes notable success in developing and using domain-specific schema; none have yet transferred beyond their beginning boundaries to the broader Web
  • Government initiatives — there are episodic successes for government-sponsored content organizational initiatives, mostly in metadata, controlled vocabularies and ontologies, often where contractors or suppliers may be compelled to comply. NIH’s National Library of Medicine (and other NIH branches) have also seen significant domain successes, due to its foresight and its receptive biology, genetics and medical communities
  • Upper ontologies — UMBEL investigated this area considerably in the early months of the project. Most of the upper ontologies have relatively sparse subject concept content, being geared to smaller, abstract-oriented “upper” structures. Some such as SUMO and DOLCE and now PROTON, have concerted initiatives to extend to middle- and domain-level ontologies [1]. To date, penetration of these systems into general Web or commercial realms has been quite limited
  • Wikipedia — a clear and phenomenal success, Wikipedia and related initiatives like Wikinvest and Wikicompany and scores more have proven to be a rich fount for named entities and article-length content, but not for the category and content organization structures in which that content is embedded. This is an area of keen academic and collective interest [2] and it may still result is useful organizational schema as these popular wikis continue to evolve and mature. However, they have not yet done so, and while a rich source for entities and data, UMBEL decided to pass on their use for “backbone” structure at this time
  • No collective structure — tagging or folksonomies or doing nothing have perhaps the greatest market share at present.

Since inception, the stated intent of the UMBEL project was to base its subject structure on extant systems. To minimize development time, the structure needed to be drawn from one of the categories above. Possible development of a de novo structure was rejected because of development time and the low probability of gaining acceptance in the face of so many competing alternatives.

Rationale for OpenCyc

The granddaddy of knowledge bases suitable to all human content and knowledge is Cyc. Because of its more than 20-year history, Cyc brings with it considerable strengths and some weaknesses.

Amongst all alternatives, Cyc rapidly emerged as the leading candidate. While its strengths warranted close attention, its weaknesses also suggested a considerable effort to overcome them. This combination compelled the need for a significant investigation and due diligence.

First, here are OpenCyc’s strengths:

  • Venerable and solid — through an estimated 200 person years of engineering and effort, the Cyc structure has been tested and refined through many projects and applications. While a few years back such groundings were unparalleled in the field, we are also now seeing some Internet-wide projects tap into the law of large numbers to get significant inputs of human labor. Cyc has also tapped this venue for ongoing expansion of its KB using the online FACTory game [3]
  • Community — there is a large community of Cyc users and supporters from academic, government, commercial and non-profit realms. Moreover, the formation of The Cyc Foundation has also served as a vehicle for tapping into volunteer effort as well
  • Upgrade Path — OpenCyc has an upgrade path to the more capable ResearchCyc, full Cyc and the services of Cycorp
  • Comprehensive — no existing system has the scope, breadth and coverage of human concepts to match that of Cyc (however, sources for named entities such as Wikipedia have recently passed Cyc in scope; see next section)
  • Common sense — since its founding as a project and then backed by the standalone Cycorp, Cyc has set for itself both a more pragmatic but harder challenge than other knowledge systems. Cyc has set out to capture the common sense at the heart of human reasoning. This objective means codifying generally unstated logic and rules-of-thumb — not unlike teaching a baby to walk and talk and read — all of which are lengthy tasks of trial and error. However, as Cyc has gained this foundation, it has also led to a more solid basis for its reasoning and conceptual relationships
  • Power and inference — ultimately the purpose of a knowledge base is to support reasoning and inference by computer when presented with a (often small) starting set of assertions or facts. Cyc has literally thousands of microtheories now governing its inference domains, giving it a scope and power unmatched by other systems. The importance of such reasoning is not the silly science fiction of autonomous intelligent robots, but as achievable aids to make connections, determine relationships and filter and order results
  • Robust supporting capabilities — such knowledge base-wide capabilities can also be deeply leveraged in such areas as entity extraction, machine translation, natural language processing, risk analysis or one of the other dozens of specialty modules available in Cyc, and
  • Free and open — last, but not least, is the fact that a mostly complete Cyc was released as a free and open source version in 2002. OpenCyc has now been downloaded more than 100,000 times and is in production use for many applications. Non-profits and academics can also obtain access to the full capabilities of the Cyc system through ResearchCyc. This open character is an absolute essential because leading Web applications and leading innovators of the Web eschew proprietary systems.

Literally, after months of investigation and involvement, the richness of practical uses to which the OpenCyc knowledge base can be applied are still revealing themselves.

Drawbacks to OpenCyc

But there are weaknesses and problems with Cyc.

To be sure, there are some individuals and perhaps some historical criticisms of Cyc that involved fears of Big Brother or grandiose claims about artificial intelligence or machine reasoning. These are criticisms of hype, immaturity or ignorance; they are different than the drawbacks observed by our UMBEL project and not further discussed here.

In UMBEL’s investigation of Cyc, we observed these drawbacks:

  • Obscure upper ontology — the Cyc upper ontology, shown in the figure below, is perhaps not as clean as more recent upper ontologies (Proton, [4] for example, is a very clean system). The various sub-classifications of ‘Thing’ and degrees of “tangibility” seem particularly problematic. However, since these are not direct binding concepts for UMBEL and provide appropriate “glue” for the upper portions of the graph, these criticisms can be easily overlooked
Cyc Upper Ontology
  • Cruft — twenty years of projects and forays into obscure domains (many for the military or intelligence communities) have left a significant degree of cruft marbled through the knowledge base. Indeed, as our vetting showed, perhaps about 30% of the concepts in Cyc are holdovers from prior projects or relate to internal Cyc-only topics
  • Reasoning concepts — another 15% or so of Cyc concepts are abstract or for reasoning purposes, such as reasoning over colors, beliefs, the sizes of objects, their orientations in space, and so forth. These are certainly legitimate concepts and appropriate to Cyc’s purposes, but are not needed or desired for UMBEL’s purposes
  • Greater expressivity — Cyc is grounded in the LISP language and has many higher-order logic constructs. Paradoxically, this greater expressiveness may make translation to UMBEL more difficult
  • Older conventions — also related to these groundings in an earlier era are the reliance on functions and functional predicates for many relations, and the absence of the current triple data model underlying RDF. While it is true that OWL versions of OpenCyc have been made and are the basis for UMBEL’s work to date, there are also errors in these translations perhaps in some instances due to the lesser expressiveness of RDF and OWL
  • Documentation — while complete reference materials can ultimately be found, it is difficult to do so and introductory and entry-level tutorials could stand to be augmented
  • Named entities — for many years, but now especially with the emergence of Wikipedia, Cyc has been criticized for its relative paucity of named entity coverage and imbalances of what it does contain. While from UMBEL’s perspective this appears to be strictly correct, such criticism misses the mark of Cyc’s special purpose and contributions as a solid conceptual and common sense framework. Those common-sense portions of the system are more immutable, and can be readily mapped to named entity sources. Indeed, perhaps Cyc will now see new vigor as the Web becomes a superior source for contemporary named entity coverage while Cyc fulfills its more natural (and needed) structural role.

Surprisingly, for a system of its age and evolution, Cyc seems to have adhered well to naming conventions and other standards.

UMBEL’s project diligence thus found the biggest issue going forward to be the cruft in the system. There is a solid structure underneath Cyc, but one that is too often obscured and not made as shiny and clean as it deserves.

The Decision and Design

Five months of nearly full-time due diligence was devoted to this question of the suitability of Cyc as the intellectual grounding for UMBEL.

On balance, OpenCyc’s benefits significantly outweighed its weaknesses. This balance also stands considerably superior to all potential alternatives.

An important factor through this deliberation was the commitment of Cycorp and The Cyc Foundation to the aims of UMBEL, and the willingness of those organizations to lend time and effort to promote UMBEL’s aims. Twenty years of development and the investment of decades of human effort and scrutiny provides a foundation of immense solidity.

Though perhaps Wikipedia (or something like it also based on broad Web input) might emerge with the scope and completeness of Cyc, that prospect is at minimum some years away and by no means certain. No other current framework than Cyc can meet UMBEL’s immediate purposes. Moreover, as stated at the outset, UMBEL’s purpose is pragmatic. We will leave it to others to argue the philosophical nuances of ontology design and “truth” while we get on with the task of creating context of real value.

The next decision was to base all UMBEL subject concepts on existing concepts in OpenCyc.

This means that UMBEL inherits all of the structural relations already in OpenCyc. It also means that UMBEL can act as a sort of contextual middleware between unstructured Web content and the inferential and tools infrastructure within OpenCyc (and beyond into ResearchCyc and Cyc for commercial purposes) and back again to the Web. We term this “roundtripping” and the capability is available for any of the 21,000 subject concepts vetted from OpenCyc within UMBEL.

Having made these commitments, our next effort was to break out the brushes, roll up the sleeves, and plunge into a Spring session of deep cleaning. This effort to vet and clean OpenCyc will be documented in the Technical Report to accompany the first release of the UMBEL ontology. We think you’ll like its shiny new look. :)

This is Part 4 of 4 on the foundations to UMBEL. This four-part series covers a Re-Introduction to UMBEL, UMBEL: Making Linked Data Classy, Subject Concepts and Named Entities, and Basing UMBEL’s Backbone on OpenCyc. These articles are lead-ins to the discussion of the actual UMBEL ontology. That series will begin next.

[1] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc, John Sowa’s Top-Level Categories and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus), though Cyc is a clear exception with its stated emphasis on capturing “common knowledge.”
[2] See, for example, this listing of about 100 academic articles devoted to structure and linguistic uses of Wikipedia: http://www.mkbergman.com/?p=417.
[3] FACTory is a game that lets people enter knowledge into the Cyc knowledge base. Via this online game, Cyc tries to determine the truth or falsehood of a series of facts. When enough people have agreed that a fact is true or not, Cyc considers it confirmed and stops asking about it. See http://game.cyc.com/helpfiles/HowToPlay.html.
[4] There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.

Posted by AI3's author, Mike Bergman

Posted on April 2, 2008 at 9:17 pm in Adaptive Information, Semantic Web, Structured Web, UMBEL | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/433/basing-umbels-backbone-on-opencyc/
The URI to trackback this post is: http://www.mkbergman.com/433/basing-umbels-backbone-on-opencyc/trackback/
Date:   April 1, 2008

Part 3 of 4 on Foundations to UMBEL

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight ontology for relating Web content and data to a standard set of subject concepts. It is being designed to apply to all types of data on the Web from RSS and Atom feeds to tagging to microformats and topic maps to RDF and OWL (among others). The project Web site is at http://www.umbel.org.

The first portion and priority for UMBEL is to prepare the lightweight subject concept ontology, the focus of this four-part foundations series. After the UMBEL ontology is released in first draft, the project will then turn to the binding protocols for non-RDF formats.

The previous part in this series discussed at length RDF classes and instances or individuals. We are now tightening these terms down to reflect the specific intents and usage within UMBEL. UMBEL’s main classes categorize subject concepts; notable instances are specifically termed named entities.

UMBEL defines subject concepts as a distinct subset of the more broadly understood concept [1] such as used in the SKOS RDFS controlled vocabulary [2], conceptual graphs, formal concept analysis or the very general concepts common to many upper ontologies [3]. We define subject concepts as a special kind of concept: namely, ones that are concrete, subject-related and non-abstract [4].

UMBEL contrasts subject concepts with abstract concepts and with named entities. Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. Named entities are the real things or instances in the world that are themselves natural and notable class members of subject concepts.

Subject Concepts

Subject concepts are a special kind of concept: namely, ones that are concrete, subject-related and non-abstract. Note in other systems or ontologies similar constructs may alternatively be called topics, subjects, concepts or perhaps interests. UMBEL has adopted the term subject concept to distinguish from these uses, which have different nuances of meaning and use, as well as to highlight the subject or topic nature of UMBEL's concrete concepts.

Each subject concept is a class. While subject concepts have a preferred label (using SKOS terminology), they are representative or a proxy for that concept, and not to be confused with the thing itself. Every UMBEL subject concept can be expressed and referred to by a different preferred label in alternate languages. Indeed, in a given language, different preferred labels may be swapped out without affecting the identity or use of the subject concept itself. The name for a subject concept is therefore merely a handle.

Subject concepts are the core constituents to the UMBEL framework. All subject concepts are based on existing concepts in OpenCyc, the open source version of the Cyc knowledge base (see Part 4). About 21,000 of them have been distilled and are part of the UMBEL backbone.

Semsets

Semsets are semantically close terms or phrases synonomous or nearly so with the meanings of a subject concept or a named entity. Semsets are akin to WordNet synsets or Cyc aliases, but can also include more contemporary jargon or slang as may be drawn from Web tagging or folksonomies. The term semset has been chosen to distinguish this consolidated meaning.

Semsets may apply to either subject concepts or named entities. In the latter case, their use is closer to the sense of an alias (such as nicknames, or "great satan" or "uncle sam" for the "United States").

Abstract Concepts

Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. They are included in the UMBEL specification because they help maintain the integrity of the UMBEL subject concept graph.

Like subject concepts, abstract concepts are based strictly on those already in OpenCyc. Abstract concepts may be viewed in the UMBEL graph, and may be used for ontology mapping, but are not generally displayed when doing standard content mapping or concept look-ups via Web services. For various domain extraction or relatedness determinations, abstract concepts may be excluded from UMBEL's internal processing.

Named Entities

Named entities are the real things or instances in the world that are themselves natural and notable class members of subject concepts. The initial named entities are drawn from Wikipedia as processed via YAGO, and other online fact-based repositories. Named entities are the instances of the subject concepts in the standard definition of the term [5].

Named entities and the sources for them are also a major avenue for growth and expansion of UMBEL moving forward. Named entities are more contemporary and changing, while the reference subject concept backbone is more fixed and stable.

Each named entity is mapped to a governing subject concept for ontology purposes. There are no relations between named entities except as mediated through a subject concept(s). As noted, named entities may also have semset aliases.

Subject Concepts v. Abstract Concepts

The following table helps draw the distinction between subject concepts and abstract concepts. Technical documentation at the time of the UMBEL ontology release will list the 520 or so abstract concepts presently within UMBEL. Looking at those can help draw the distinction.

Subject Concepts Abstract Concepts
  • Nouns or noun phrases
  • These are concrete kinds of things or ideas in the real world
  • Broad, collective, reference concepts, often hierarchically related
  • Similar to “topics” or “subjects”, these other terms are used in somewhat different ways in alternative schemas
  • Collections or classes of like “kinds” of items
  • Quite stable in scope, breadth and structure
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure
  • Named entities are members of subject concepts
  • These are either: 1) abstract (truth, beauty, evil) concepts, or 2) artificial thought constructs for organizing things but not encountered as standalone concepts in their own right (e.g., PartiallyTangibleThing)
  • Collections or classes of like “kinds” of items
  • Class members may be either other abstract concepts or subject concepts
  • Class members are never named entities
  • Tend to reside higher in the subsumption structure
  • Generally hidden from the UMBEL subject concept reference “backbone” structure
  • May be used for ontology mapping purposes
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure

Subject Concepts v. Named Entities

The following table helps draw the distinction between subject concepts and named entities. Technical documentation at the time of the UMBEL ontology release will describe certain "gray" categories and the determination as to whether they should be treated as one or the other.

For example, most geographical places clearly belong to the named entity category. But, on somewhat arbitrary grounds, all nations, countries, states and provinces were assigned as subject concepts so that they would act as classes with other entities mapped to them. It should also be noted that entites or concepts in the gray zone may be treated both as a named entity and a subject concept.

Subject Concepts Named Entities
  • Broad, collective, reference concepts. In a hierarchical category structure, subject concepts represent the “root” or “branch” nodes
  • Nouns or noun phrases
  • Called “subject concepts” (or sometimes as a shorthand, “concepts”). Similar to “topics” or “subjects”, these other terms are used in somewhat different ways in specific in alternative schemas and are therefore not used interchangeably here
  • These are not abstract (truth, beauty, evil) concepts, but concrete about kinds of things or ideas in the real world; abstract concepts are often properly part of what are known as “upper ontologies” but they are not applicable for UMBEL’s purposes
  • Collections or classes of like “kinds” of items
  • Quite stable in scope, breadth and structure
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure
  • Basis for the UMBEL subject concept reference “backbone” structure
  • Named entities are members of subject concepts
  • Atomic, specific objects, often famous or well-known, that belong to reference “types” such as persons, places, organizations, events, products, time intervals, etc. In a hierarchical category structure, named entities represent the “leaves”
  • Nouns or noun phrases
  • Called “named entities” not entities alone, to prevent confusion with other general senses of the term “entity” and in keeping with named entity recognition (NER).
  • Very concrete, atomic entities
  • The number and scope is fluid and growing, and potentially of huge size as specific objects are named
  • Often expressed as a proper noun (with some capitalization), but not necessarily so. Common animal, plant, object, substance names also can be named entities
  • Major sources are Wikipedia (YAGO), and similar such as Wikinvest, Wikicompanies, etc.
  • Named entities are maintained and treated separately from the UMBEL subject concept ontology
  • Every named entity belongs to at least one subject concept.

Though there are shades of gray between subject concepts and named entities, we have found this distinction to be a powerful means for gaining clarity in UMBEL’s design. It provides a clean path for keeping an ontology lightweight while in essence providing infinite extensibility for all manner of named entities and the datasources that contain them. Moreover, the ability to classify named entities into types orthogonal to subject concepts also provides useful guidance for presentation templates that may be automatically invoked in data meshups. But, that is a topic for another day. :)

This is Part 3 of 4 on the foundations to UMBEL. This four-part series covers a Re-Introduction to UMBEL, UMBEL: Making Linked Data Classy, Subject Concepts and Named Entities, and Basing UMBEL’s Backbone on OpenCyc. These articles are lead-ins to the discussion of the actual UMBEL ontology that will soon follow.

[1] As the term is used in mainstream cognitive science and philosophy of mind, a concept is an abstract idea or a mental symbol, typically associated with a corresponding representation in a language or symbology. Definition is from Wikipedia; see further http://en.wikipedia.org/wiki/Concept.
[2] SKOS stands for Simplified Knowledge Organization Systems; it is a controlled vocabulary based on RDF Schema designed to allow the creation of formal languages to represent thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured information. See http://www.w3.org/2004/02/skos/.
[3] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget's Thesaurus), though Cyc is a clear exception with its stated emphasis on capturing “common knowledge.”
[4] A subject concept bears some resemblance to dc:subject or foaf:interest in other ontologies. However, unlike those approaches, UMBEL: 1) provides a reference set of subject concepts to pick from and synonym-like relationships similar to WordNet synsets; and 2) are not semantically literal descriptions for the terms, but rather "proxies" for the concepts they represent. This referential character for subject concepts make them readily transferrable to multiple human languages.
[5] In a named entity, the word named applies to entities that have a "rigid designators" as defined by Kripke for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind of terms like biological species and substances. BBN categories proposed in 2002 consists of 29 types and 64 subtypes; Sekine’s extended hierarchy also proposed in 2002 is made up of 200 subtypes. We use Sekine (http://nlp.cs.nyu.edu/ene/version6_1_0eng.html) as our guide. For example, Sekine's top 15 named entity classes are: Name_Other, Person, Organization, Location, Facility, Product, Event, Natural_Object, Title, Unit, Vocation, Disease, God, Id_Number and Color; the remaining types are subsumed under these. See further http://en.wikipedia.org/wiki/Named_entity_recognition. Generally, named entities are the instances of UMBEL classes.

Posted by AI3's author, Mike Bergman

Posted on April 1, 2008 at 11:35 pm in Adaptive Information, Semantic Web, Structured Web, UMBEL | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/432/subject-concepts-and-named-entities/
The URI to trackback this post is: http://www.mkbergman.com/432/subject-concepts-and-named-entities/trackback/
Date:   March 31, 2008

Part 2 of 4 on Foundations to UMBEL

Triangular NumberArguably Linked Data is the breakthrough that has triggered re-evaluation and increased comprehension of the semantic Web vision.

Linked Data follows recommended practices for identifying, exposing and connecting data. A robust Linking Open Data (LOD) community has developed around the practice in the past year with the size of compliant data now exceeding several billion RDF triples.

Like any new development, there has been the need for best practices to be articulated and documented. Some of the best guides are How to Publish Linked Data on the Web from Chris Bizer, Richard Cyganiak and Tom Heath; Cool URIs for the Semantic Web from Leo Sauermann and Richard Cyganiak; the Linked Data for the Web chapter from Joshua Tauberer; and Deploying Linked Data from OpenLink Software. Also, to see and experience Linked Data just follow Kingsley Idehen’s blog and prolific mailing list postings, almost always with valuable demos and links.

The techniques these documents most often explain deal with such items as exposing and dereferencing URIs, content negotiation and naming and distinguishing so-called (unfortunately) information resources and non-information resources [1]. The above references cover these topics with good clarity. The general tenor of these guides is on the techniques of exposing and publishing Linked Data.

OK About Exposing, What About Linking?

UMBEL is really a mechanism for aiding the linkage of data, not exposing it per se, so we will leave the discussion of exposing and publishing Linked Data to these other venues. Those other venues deal well with the Data portion of Linked Data. We want to focus here on the Linked portion.

At first blush, it is surprising how little is actually said or written about this linkage portion. For example, in the best practices How to Publish Linked Data there is fairly minor discussion of external links in Section 2.2 and then the sole discussion on links limited to Section 6.

To quote in part:

RDF links enable Linked Data browsers and crawlers to navigate between data sources and to discover additional data. The application domain will determine which RDF properties are used as predicates. . . . It is common practice to use the owl:sameAs property for stating that another data source also provides information about a specific non-information resource. An owl:sameAs link indicates that two URI references actually refer to the same thing. . . . RDF links can be set manually, which is usually the case for FOAF profiles, or they can be generated by automated linking algorithms. This approach is usually taken to interlink large datasets.

Upon reflection, though, perhaps less coverage of the linkage portion of Linked Data is not that surprising. The Linked Data practice is barely one year old and, while growing at a most impressive rate, is still in the very earliest phases. Frankly, until recently, there has not been really that much data to Link.

We can see the status of Linked Data via the now-famous Linked Open Data diagram maintained by Richard Cyganiak (see [2] for the most recent interactive version; this one is current as of the date of posting):

Linked Data Web

Many have used this figure before (including me) to make general statements about the state of Linked Data. In this post, however, I want to comment on some different aspects.

While new data sources (or bubbles) are being added constantly, I count 43 “mappings” on this diagram (the arrows, and ignoring bi-directional) and 34 different sources (the bubbles). Nineteen of those mappings involve DBpedia, the exposed data of Wikipedia, and 11 involve FOAF.

owl:sameAs relations between possible datasets are in essence a pairwise mapping, similar to how a group of people might toast one another by clinking glasses. This type of pairwise relationship is kind of like an additive analog to a factorial, which is actually a quadratic function more specifically known as a triangular number. As new datasets get added, we see a progression of the form 1, 3, 6, 10, 15, 21, 28, 36, etc., representing the number of these possible pairwise mappings (“glass clinks”) between datasets.

The actual equation for this progression is n*((n+1)/2), where n is the number of datasets (N) – 1. Nominally, then, the number of 34 dataset bubbles could lead to as many as 561 pairwise mappings. But, again, only 43 are shown.

Of course, we are still only talking about potential pairwise mappings between datasets, and not the number of actual instance mappings themselves. DBpedia alone contains 1.5 million or so instances.

We can factor our progression into the Big O computer science notation consistent with the quadratic form of O(n2). Now, with instances numbering into the millions compounded by only a very few datasets and their pairwise mappings, we are still talking about potentially astronomical numbers to express as linked triples.

Yet the actual number of mapped triples is much lower than these potential maximums. The amount of Linked Data remains tractable. Why?

The first and obvious answer is that not all pairwise mappings make sense and not all instances are equivalent (sameAs). This factor will always be true. But it does not alone account for the efficiency.

The second less obvious answer is that certain of the datasets act as reference nodes or hubs. By having them, everything need not be mapped to everything else. We can express linkages as N to one or N to a few, and not the asymptotically growing N-to-N. Newly added datasets may often and for a notable portion of instances only need to be mapped to the reference nodes in order to link their data into the network.

DBpedia, with its scope and richness of notable instances, therefore, plays an essential role in Linked Data as expressed to date. Other comprehensive and authoritative sources can act similarly. In this manner, the development of the Linked Data graph may mirror the hub aspects of the existing World Wide Web document graph.

Rich reference nodes acting as hubs appears to be a key to the scalability of the Linked Data Web.

A Short Aside on These Links

[A publisher exposes Linked Data by making the URI of an RDF resource ("data") accessible via HTTP. When encountered, an agent (such as a browser or crawler) can then dereference this resource to a Web-based URL address for retrieval.

Any attribute or relationship that describes such Linked Data may be accessed at time of retrieval. A relationship defines an external link when either the subject or object of the triple is an external URI. If that resource's URI has also been properly exposed, we can now trace it to still new relations and resources, akin to data surfing (so long as the trail of resources remains exposed). In a parallel analogy to document hyperlinks, some have termed this 'hyperdata' [4].

But that leads to a funny thing. Without this fetch or retrieval, there is generally no explicit publishing or knowledge of the external links for these resources. In other words, these external links are not “publicly” obvious (so to speak), until the Linked Data resource is discovered or stumbled upon. So, while our current recipes give us best practices for how to expose and publish resources (the Data), we have no similar guidance — and, frankly, no practice — for the Linked portion of Linked Data.

To carry the analogy a bit further, while Linked Data is acting to break the barriers of data silos, the relations and linkages of that data remain in those silos until accessed. This may indeed be the proper thing, but somehow it has a feeling of the early years of the document Web before services like Yahoo! began publishing listings of useful links.

Given that the mappings between data sources represent new and often expensive manual or automated effort, it seems like invested value is not being sufficiently shared. Fortunately, there is nothing preventing us from explicitly dereferencing these linked mapping triples along with standard resource triples. We just need to begin doing so.

But I digress. :) ]

Another Little Known Secret of Linked Data

The careful reader may have noticed a couple of earlier implications. Current Linked Data is useful for linking data for given instances from different data sources (say, for combining political, demographic and mapping information for a geographic place like Quebec City) via owl:sameAs. But that predicate is an instance-level relationship that only works for the very same individuals [3]. Our current ability to make external linkages is largely constrained to the instance level.

Moreover, such instance-level links lack context and a conceptual framework for inferencing or determining relatedness between concepts or in relation to other instances. Today’s state-of-the-art is not really about linking “things” (as quoted before) when we establish Linked Data. It is more about linking atomic instances, the members of “things”. We have no current framework for relating things at the concept level.

Put another way, Linked Data presently lacks practical frameworks or mechanisms for linking at the class level.

On the face of it this sounds contradictory to what we know about RDF and how it is designed. The language and its formalisms and indeed many of the popular RDF schema have a rich set of classes. Classes are easy to design and spin-off on the fly.

So, the commentary about lack of a framework is NOT about the lack of logic or vocabulary or even schema or ontologies (though there are certainly some gaps there). Rather, it is based on the lack of reference nodes or structures upon which to base those connections. When there is no fixed or defined point in information space, everything floats; there is no framework for connections. There is no grounding.

So, technically, while class-level data connections are not prevented and can be made with Linked Data, to our knowledge few or none presently exist [5]. This is a little known secret with far-reaching implications.

Just as DBpedia provided the nucleating hub for linking instance data, UMBEL is designed to provide a similar reference node for concepts. These concepts provide some fixed positions in the information space to which other sources can link and relate. And, like references for instance data, the existence of reference concepts can greatly diminish the number of links necessary for an efficient Linked Data environment.

Though the nature of the reference set is important (and we describe UMBEL’s choice of OpenCyc in Part 4 of this series), a more fundamental point is that a reference structure of almost any nature has this value. We can argue later about what is the best reference structure.

But the first task is to just get one in place and begin bootstrapping. Indeed, over time, it is likely that a few reference structures will emerge and compete and get supplemented by still further structures. This evolution is expected and natural and desirable in that it provides choice and options.

A reference structure of concepts has the further benefit of providing a logical reference structure for instance data as well. While a DBpedia (based on Wikipedia) is perhaps the most comprehensive collection of humanity-wide instances, no single source can or will be complete in scope. Thus, we foresee specialty sources ranging from the companies in Wikicompany to plants and animals in the Encyclopedia of Life or thousands of other rich instance sources also acting as reference hubs.

How do each of these rich instance sources relate to one another? What is the subject concept or topical basis by which they overlap or complement? What is the framework and graph structure of knowledge to give this information context?

These roadmaps and signposts are UMBEL’s formal purpose.

Taking Linked Data to the Next Level

Mapping between classes is a much different — and more complicated — matter than mapping instances. As editors we are still grappling with design choices here and are playing with ideas such as confidence metrics to capture the relative accuracy of set matching methods. Later ontology documentation will discuss these designs further.

The rationale for UMBEL and our observations on the state of current Linked Data is not meant to be critical. The community is early in its understanding of how to do Linked Data and scale it. Personally as editors and then on behalf of our company, we have clearly committed to Linked Data as a practice and objective.

In summary, our review of the current state of Linked Data suggests that we:

  1. Need reference sets to aid scalability
  2. Need context (UMBEL classes) for adding new reference instances, and
  3. Need context for relating classes to one another.

Reference sets are a real key, both for instances and concepts. Using them by no means implies centrality or a loss of the distributed advantages of the Web. DBpedia (Wikipedia) has not had this effect for instances and UMBEL will not do so for concepts.

Nor does the use of reference sets imply the need to reach some global consensus or to close out any alternatives. Reference hubs and choice and freedom are not in conflict. Placing data in context will show clear advantages over data absent context. The argument will be settled as simply as that.

Now that Linked Data has put forward the recipes and mechanisms for opening up and sharing data on the Web, it is now time to take the initiative to the next level by providing the contextual signposts and roadmaps for those linkages.

This is Part 2 of 4 on the foundations to UMBEL. This four-part series covers a Re-Introduction to UMBEL, UMBEL: Making Linked Data Classy, Subject Concepts and Named Entities, and Basing UMBEL’s Backbone on OpenCyc. These articles are lead-ins to the discussion of the actual UMBEL ontology that will soon follow.

[1] I separately have written on the current terminology challenges in The Shaky Semantics of the Semantic Web and will not further discuss here.
[2] See http://richard.cyganiak.de/2007/10/lod/; current figure updated as of 3/31/08; with growth, this diagram will soon become too crowded.
[3] From the W3C’s OWL Web Ontology Language Reference (W3C Recommendation 10 February 2004), “The built-in OWL property owl:sameAs links an individual to an individual. Such an owl:sameAs statement indicates that two URI references actually refer to the same thing: the individuals have the same ‘identity’.” See http://www.w3.org/TR/owl-ref/#sameAs-def. Note, also, that many external links are actually to controlled vocabularies and definitions and not to “linkable” instance data.
[4] Danny Ayers has been one of those suggesting the hyperdata terminology While there may be further technical distinctions, the general idea is to draw equivalences between hyperlinking in the initial Web of Documents to the emerging Web of Data. In any case, this brief discussion should help make it clear that simply publishing RDF is very much insufficient to qualify as Linked Data.
[5] Of course, connections to classes in external ontologies are the basis for common vocabularies and the semantics of Linked Data. The contrast we are drawing is to class relationships of instance data.

Posted by AI3's author, Mike Bergman

Posted on March 31, 2008 at 10:46 pm in Adaptive Information, Semantic Web, Structured Web, UMBEL | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/431/umbel-making-linked-data-classy/
The URI to trackback this post is: http://www.mkbergman.com/431/umbel-making-linked-data-classy/trackback/
Date:   March 30, 2008

Part 1 of 4 on Foundations to UMBEL

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight ontology for relating Web content and data to a standard set of subject concepts. It is being designed to apply to all types of data on the Web from RSS and Atom feeds to tagging to microformats and topic maps to RDF and OWL (among others). The project Web site is at http://www.umbel.org.

UMBEL was first announced in July 2007 and has been a direct subject of these prior posts:

However, much internal development and refinement has occurred especially in the past few months [1]. Over the next few days, this posting, a re-introduction to UMBEL, will be followed by these additional parts:

These articles are lead-ins to the discussion of the actual UMBEL ontology that will soon follow.

Purpose

UMBEL has two purposes: 1) to provide a lightweight structure of subject concepts as a reference to what Web content or data “is about“, what is called a concept schema in SKOS [4]; and 2) to define a variety of binding protocols for different Web data formats to map to this “backbone.” The project’s immediate priority is to first create this reference backbone [2]. That is the focus of these current postings.

The UMBEL backbone traces the major pathways through the content graph of the Web.

Think of the backbone as a set of roadsigns to help find related content. UMBEL is like a map of an interstate highway system, a way of getting from one big place to another. Once in the right vicinity, other maps (or ontologies), more akin to detailed street maps, are then necessary to get to specific locations or street addresses.

By definition, these more fine-grained maps are beyond UMBEL’s scope. But UMBEL can help provide the context for placing such detailed maps in relation to one another and in relation to the Big Picture of what related content is about.

These subject concepts also provide the mapping points for the many, many thousands (indeed, millions) of specific named entities that are the notable instances of these subject concepts. Examples might include the names of specific physicists, cities in a country, or a listing of financial stock exchanges. UMBEL mappings enable us to link a given named entity to the various subject classes of which it is a member.

And, because of relationships amongst subject concepts in the backbone, we can also relate that entity to other related entities and concepts. The UMBEL backbone traces the major pathways through the content graph of the Web. For some visualizations of this subject graph, see So, What Might The Web’s Subject Backbone Look Like?

Relation to Linked Data

Today, the actual linkages in Linked Data, the first meaningful expression of the semantic Web, occur almost exclusively via direct sameAs relationships between instances. An easy way to think of one of these notable instances is as the topic of a specific article in Wikipedia. People, places, important historical events, and so forth are all examples of such named entities.

Current Linked Data is therefore useful for linking data for given instances from different data sources (say, for combining political, demographic and mapping information for a geographic place like Quebec City). But, these instance-level links lack context and a conceptual framework for inferencing or determining relatedness between concepts or in relation to other instances. For these purposes, Linked Data needs a class structure (Part 2).

As noted, UMBEL’s class structure is based on subject concepts, which are a distinct subset of the more broadly understood concept [3] such as used in the SKOS RDFS controlled vocabulary [4], conceptual graphs, formal concept analysis or the very general concepts common to many upper ontologies [5]. We define subject concepts as a special kind of concept: namely, ones that are concrete, subject-related and non-abstract [6].

UMBEL contrasts its subject concepts with abstract concepts and with named entities. Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. Named entities are the real things or instances in the world that are themselves natural and notable instances (members) of subject concepts (classes). More detailed discussion of this terminology is provided in Part 3.

Basic Approach

UMBEL thus sets for itself objectives that include an identification of subject concepts and their relationships; a premise of emphasizing representational concepts over unattainable precision or exactitude; and a means for relating any notable thing of the world to that structure. Moreover, meeting these objectives should be based on best systems and practices, informed where possible by social acceptance and consensus.

W-O-W-Y is the shorthand we apply to the semantic framework for meeting these UMBEL objectives. W-O-W-Y is derived from the constituent UMBEL building blocks of WordNet (W) [7], OpenCyc (O), Wikipedia (W) [8] and YAGO (Y) [9]. Each resource contributes in a different way.

Via the WOWY framework, OpenCyc provides the basis for the reference subject backbone (Part 4), WordNet (supplemented by others) provides the “synsets” for relating natural language nouns and phrases to these concepts, and Wikipedia as processed by YAGO (among a growing list of other resources) provides the starting dictionary of relevant named entities important to the Web public.

The initial UMBEL ontology contains roughly 21,000 subject concepts distilled from OpenCyc and vetted for their relational structure. A further 1.5 million named entities are also currently mapped to that structure.

Coincident with the pending release of the draft UMBEL ontology of these subject concepts will be a multi-volume technical report that details the exact mapping and vetting procedures used.


[1] UMBEL’s editors are Michael Bergman and Frédérick Giasson.
[2] UMBEL’s second purpose for binding mechanisms is not further discussed in this series. That design includes binding mechanisms to work with HTML, tagging or other standard practices, including various RDF schema and more formal ontologies. At minimum, support is intended for Atom, microformats, OPML, OWL, RDF, RDFa, RDF Schema, RSS, tags (via Tag Commons or MOAT or SCOT), and topic maps. Finally, thorugh fulfilling its two purposes, the UMBEL project will also provide a data set registration service, information and collaboration Web site, tools clearinghouse, and support for language translations and some tools development. These latter initiatives are part of the longer-term UMBEL project development plan.
[3] As the term is used in mainstream cognitive science and philosophy of mind, a concept is an abstract idea or a mental symbol, typically associated with a corresponding representation in a language or symbology. Definition is from Wikipedia; see further http://en.wikipedia.org/wiki/Concept.
[4] SKOS stands for Simplified Knowledge Organization Systems; it is a controlled vocabulary based on RDF Schema designed to allow the creation of formal languages to represent thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured information. See http://www.w3.org/2004/02/skos/.
[5] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus), though Cyc is a clear exception with its stated emphasis on capturing “common knowledge.”
[6] A subject concept bears some resemblance to dc:subject or foaf:interest in other ontologies. However, unlike those approaches, UMBEL: 1) provides a reference set of subject concepts to pick from and synonym-like relationships similar to WordNet synsets; and 2) are not semantically literal descriptions for the terms, but rather “proxies” for the concepts they represent. This referential character for subject concepts makes them readily transferrable to multiple human languages.
[7] WordNet  is the oldest and largest database of English nouns, adjectives, verbs and adverbs and their linguistic relationships. Terms are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept, which are interlinked by means of conceptual, semantic and lexical relations. The database is the acknowledged authority for use in computational linguistics and natural language processing. UMBEL principally uses WordNet nouns; the most recent version 3.0 of the database has 117,798 unique nouns with 82,115 associated synsets. WordNet is available as open source; it is provided and maintained by the Cognitve Science Laboratory at Princeton University; see http://wordnet.princeton.edu/.
[8] Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research; see here for a listing of about 100 academic papers. The system is being tapped for both data and structure. The tremendous growth of content and topics within Wikipedia is well documented (see, as examples, the internal Wikipedia statistics W1, W2, W3, and W4). As of early 2008, Wikipedia had about 2.25 million articles in English and versions in 256 languages and variants.
[9] YAGO (“yet another great ontology”) was developed by Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum at the Max-Plack-Institute for Computer Science at Saarbrücken University. A WWW 2007 paper and more recent and detailed technical report describing the project is available, plus there is an online query point for the YAGO service and the data can be downloaded. YAGO combines Wikipedia and WordNet to derive a rich hierarchical ontology with high precision and recall. The system contains nearly 100 data relations (predicates).

Posted by AI3's author, Mike Bergman

Posted on March 30, 2008 at 8:54 pm in Adaptive Information, Semantic Web, Structured Web, UMBEL | Comments (5)
The URI link reference to this post is: http://www.mkbergman.com/430/a-re-introduction-to-umbel/
The URI to trackback this post is: http://www.mkbergman.com/430/a-re-introduction-to-umbel/trackback/
Date:   March 12, 2008

Let’s Stop Presenting a Muddled Mess to the Broader Public

Despite all of the breakthroughs of the past year, the semantic Web community at times looks determined to snatch defeat from the jaws of victory. It is ironic that the very thing the semantic Web is about — meaning — is also the very grounds of its current challenge. Sometimes we look like the gang that could not talk straight.

This semantic challenge arises from a variety of causes. Some are based on different contexts or perspectives (or sometimes lack thereof). Some is due to the sheer, tough slogging of defining and explicating a new field, its concepts, and standards and technology. And some, unfortunately, may at times arise from old-school thinking that to define or brand something is to “own” it. (Or, worse still, to overhype it and then not deliver.)

We are now in the second wave of the semantic Web. The first wave were those dozens of individuals, mostly from academia and the W3C, who have been laboring diligently on standards, tools and language for more than a decade. Most of the community’s leaders come from this group and they are largely the stewards of the vision. The second wave, which arguably began when the iron association of RDF with XML was broken, has perhaps hundreds or thousands of members. Many are still researchers, but many are also tools builders and explicators and evangelists. Some, even, are entrepreneurs.

These two groups constitute the current semantic Web community, which is an active and increasingly visible one of blogs, reports, conferences, pragmatic tools and new companies and ventures. Financial interests and the business and technical press are also becoming more involved and active.

These two waves — or however you want to bound the current community; frankly the definition is unimportant — need to recognize that our communications must achieve better clarity as we make outreach and spread into the broader public. Muddled concepts, akin to the unfortunate earlier RDF-XML association, if not cleared up, can also hinder adoption.

We have a responsibility to think hard about what should be our common language and the clarity of our concepts moving forward. Let’s not repeat language mistakes of the past.

We should not rush to embrace “market speak”. Nor should we fear questioning current terminology or constructs where it is obviously slowing adoption or requires explanatory legerdemain. The semantic Web is about making meaningful connections and relationships; we should follow that same guidance for our language and conceptual metaphors.

Common language is like a pearl, with each new layer accreting upon the one before it. Current terms, definitions, acronyms, standards, and practices form the basic grain. But we are many players, and do not speak with one voice. Yet, even were we to do so, whatever we think best may not be adopted by the broader public. What is adopted as common language has a dynamic all its own. Practice — not advocacy — defines usage.

However, we do know that the concepts underlying the semantic Web are both foreign and difficult for the broader public. We can take solace that with HTML and other standards and protocols of the Web that such difficulties are not ultimately barriers to adoption. If it has value (and all of us know the semantic Web does), it will be adopted. But, on the other hand, insofar as our language is unnecessarily technical, or perhaps muddled conceptually or difficult to convey, we will unfortunately see a slower rate of adoption.

What we have is good — indeed very good — but it could be much better. And it is more often than not language than ideas that get in the way.

The ‘Big Picture’ is Not a Snapshot

The casual observer of the semantic Web can not help but see the occasional storms that roil our blogs and mailing lists. The storm activity has been especially high recently, and it has been a doozy. It has been as fundamental as defining what we are about and our space to heated flashpoints around what had seemed to be settled concepts (we’ll address the latter in a bit).

The first challenge begins with how we even name our collective endeavor. For a decade, this name has been the ‘Semantic Web’. But, either due to real or perceived past disappointments or the imperatives of a new era, this label has proven wanting to many. Benjamin Nowack recently compiled some of the leading alternatives being promulgated to define the Semantic Web space:

This is a useful list; but it says more than the breadth of its compilation. There is a logical flaw in trying to define the semantic Web as either a “thing” or as a static achievement. These important distinctions get swamped in the false usefulness of a list.

For example, my name is associated with one of those names on that list, the structured Web, but I never intended it as an alternative or marketing replacement to the semantic Web. I also have been a vocal advocate for the Linked Data concept.

We, as its current ambassadors and advocates, need to stress two aspects of the semantic Web to the broader public. First, the semantic Web is inherently about meaning. If we ourselves are not doing all we can to explicate the meaning of the semantic Web in our language and terminology, then we are doing the vision and our responsibilities a disservice.

Second, the semantic Web is also an ideal, a vision perhaps, that will not appear as a bright line in the dark or a flash of light from the blue. It will take time and much effort to approximate that vision, and its ultimate realization is as unlikely as the timeless human desire for universal communication as captured by the apocryphal Tower of Babel.

Today’s circumstance is not a competition for a static label, but a constant evolution from unstructured to structured data and then increasing meaning about that data. Our true circumstance today — that is, the current “state” of this evolution to the semantic Web — is Linked Data, as placed in this diagram:[1]

Transition in Web Structure
Document Web Structured Web
Semantic Web
Linked Data
  • Document-centric
  • Document resources
  • Unstructured data and semi-structured data
  • HTML
  • URL-centric
  • circa 1993
  • Data-centric
  • Structured data
  • Semi-structured data and structured data
  • XML, JSON, RDF, etc
  • URI-centric
  • circa 2003
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S
  • URI-centric
  • circa 2007
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S, OWL
  • URI-centric
  • circa ???

Many of the so-called alternative terms for the semantic Web are really attempts to capture this dynamic of evolution and development. On the other hand, watch out for those terms that try to “brand” or label the semantic Web as a static event; they are the ones that lack meaning.

It is a cliché that conflict sells newspapers. But, we also know that newspaper sales are dropping daily. Old thinking that tries to “brand” the semantic Web space is ultimately due to fail because it is fundamentally at odds with the purpose of the semantic Web itself — linking relevant information together with meaning. Meaningless labels are counter to this aim.

Aside from the hijackers, the community itself, sure, should want better language and communications. That, after all, is the purpose of this post. But, there is nothing to be ashamed of with the banner of the ‘semantic Web’, the original and still most integral view of Tim Berners-Lee. Let’s just be clear this is not a bright line achievement, it is an ongoing process, and that there may be many labels that effectively capture the evolution of the vision as it may then be current at any point in time.

Today, that state of the art is Linked Data.

Data Federation in Perspective

It is easy to forget the recent past. To appreciate the current state of the semantic Web and its prospects, it is essential to understand the progress in data federation over the past decade or two. Incredible progress has been made to overcome what had been perceived as close-to intractable data interoperability issues within my working lifetime.[2]

“Data federation”  — the recognition that value could be unlocked by connecting information from multiple, separate data stores  — became a research emphasis within the biology and computer science communities in the 1980s. It also gained visibility as “data warehousing” within enterprises by the early-90s. However, within that period, diversity and immaturity in hardware, operating systems, databases, software and networking hampered the sharing of data.

From the perspective of 25 years ago, when we were at the bottom, the “data federation” challenge was an imposing pyramid:

Rapid Progress in Climbing the Data Federation Pyramid

When the PC came on the scene, there were mainframes from weird 36-bit Data General systems to DEC PDP minicomputers to the PCs themselves. Even on PCs, there were multiple operating systems, and many then claimed that CP/M was likely to be ascendant, let alone the upstart MS-DOS or the gorilla threat of OS/2. Hardware differences were all over the map, operating systems were a laundry list two pages long, and nothing worked with anything else. Computing in that era was an Island State.

Client-server and all of the “N-tier” speak soon followed, and it was sort of an era of progress but still costly and proprietary answers to get things to talk to one another. Yet there was beginning to emerge a rationality, at least at the enterprise level, for how to link resources together from the mainframe to the desktop. Computing in that era was the Nation State.

But still, it was incredibly difficult to talk with other nations. And that is where the Internet, specifically the Web protocol and the Mosaic (later Netscape) browser came in. Within five years (actually less) from 1994 onward the Internet took off like a rocket, doubling in size every 3-6 months.

In the early years of trying to find standards for representing semi-structured data (though not yet called that), the major emphasis was on data transfer protocols. In the financial realm, one standard dating from the late 1970s was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript and others.

Of course, midway into these data representation efforts was the shift to the Internet Age, blowing away many previous notions and limits. The Internet and its TCP/IP protocols and HTML and XML standards for “semi-structured” data and data transfer and representations, in particular, overcame physical and syntactical and data exchange heterogeneities, helping us climb the data federation pyramid still higher.

Yet, as the pyramid shows, despite massive progress in scaling it, challenges remain now centered at the layer of semantics. Related to that are the issues of what data you can trust and with what authority.

A Rose by Any Other Name

Now that the focus has evolved to the level of meaning, the standards work of the semantic Web of the past ten years faces the challenge of being communicated and inculcated into the broader public. Frankly, this is proving to be rough sledding, and is one of the reasons, perhaps, why others are seeking new marketing terms and to diminish the semantic Web label.

To be sure, as with any new field over history from agriculture to rocket science, there is technical terminology associated with the semantic Web. Things such as RDF, OWL, ontologies, SPARQL, GRDDL, and many, many other new terms have been discussed and presented in beginning guides elsewhere (including by me). Such specificity and terminology is expected and natural in new technical areas. In itself, new terminology is not automatically grounds for hindering adoption.

But, in a growing crescendo on semantic Web mailing lists, and to which I agree in some measure, some of the conceptual and terminological underpinnings of the semantic Web are being shaken hard.

As background, realize that the subjects and relationships available on the semantic Web are defined as requiring a unique identifier, one which corresponds to the addressing scheme of the Web now familiar to us as URLs. However, since URLs were designed for single documents or Web pages, they now need to be abstracted a level further to a URI to reflect the potential data objects formerly masked within a document-centric URL view.

In the initial Web, a document was a document, and a URL could clearly be related to its representation. But, now that the resource objects are shifting from documents to data, a document could potentially contain multiple data items, or could be a reference to the actual data elsewhere. This potential indirection has proven to be a real bugaboo to semantic Web researchers.

The authors of the semantic Web are mostly computer scientists and not linguists or semanticists. They have looked at the idea of resources quite literally, and have tended to view whether the target of a reference is to its actual document object (an “information resource”, linked to via a URL) versus an indirection source ( a “non-information resource”, referred to by its URI). This tenuous distinction between information and non-information appears arbitrary and, in any case, is difficult for most to understand. (It also fundamentally confuses the notions of resource and representation.)

Unfortunately, this same literal perspective has tended to perpetuate itself at other levels of confusion. Attempting to access a “non-information” resource forces the need to resolve to an actual retrieval address, a process called “dereferencing”. Lacking true semantic sophistication, some resources felt to represent the same object (but lacking any metrics for actually determining this equivalence) have also been related to one another through (in some instances) an indefensible “same as” relationship. URIs may also be absent for subject or object nodes in an RDF triple causing an empty reference in those instances and so-called blank nodes, or “bnodes.”

The power of the RDF triple easily trumps these shortcomings, but the shortcomings remain nonetheless. These shortcomings manifest themselves at times in wrong conceptual underpinnings or relationships or at other times in making communication of the system very difficult to newcomers.

Most of these problems can be readily fixed. RDF itself is a very solid foundation and has no real shortcomings at this point. The remaining problems are mostly terminological and where logic changes might be warranted, those are minor.

There has, however, been keen reluctance to recognize these logical and terminology shortcomings. Those failures are somewhat an artifact of the underlying semantics. The Web and the Internet, of course, is a representational system, and not one of true resources or objects. That hang up alone has cost the community major bonus bucks in getting its story and terminology straight.

Alas, while the system works splendidly, it sorely needs true semantic attention from linguists, philosophers and ontologists that better understand representation v. resources. This is a missing step, that, if taken, could truly be the missing secret sauce.

Lastly, there are parties trying to coin new terms and terminology in order to “own” the evolving semantic Web. This was a strategy that worked in decades past for enterprise systems where vendors tried to define new “spaces”. But, in the context of the semantic Web where the objective is interoperability and not proprietary advantage, such approaches are doomed to failure, as well as unfortunately acting as brakes on broader adoption.

The Web is a Dirty Place

Semantic Web terminology nuances and subtleties that dance on the head of pins moreover belie another fundamental truth: The Web is a dirty place. If users or automatic software can screw it up, it will be screwed up. If darker forces can find bypasses and trapdoors, those will be opened as well.

This should be no shocking truth to anyone who has worked on the Web for any period longer than a few years. In fact, in one of my former lives, we made our money by exploiting that very diversity and imperfect chaos by finding ways to talk to hundreds of thousands of dynamic search forms and databases. Talk about screwed up! Every search engine vendor and any Web site developer who has to struggle with getting CSS to work cross-browser knows intimately at least in part of what I speak.

It is thus perplexing how many in the semantic Web community — who should truly know better — have continued to promote and advocate systems that are fragile, easily broken, and not robust. While RDF is certainly the best lingua franca for data interoperability, it is incredibly difficult to set up Web sites to publish in that format with proper dereferencing and exposure of URIs and endpoints. The simpler protocols of the standard Web public and standard Web developers reflect adoption realities that should not be dismissed, but actively embraced and co-opted. Denials and either-or arguments set up artificial boundaries and limit community growth and acceptance.

Ten-page how-to manuals for how to publish semantic Web data and complicated stories about 303 see other redirects and Apache technical configurations are a loser. This is not the way to gain broad-scale adoption. And it certainly is not a great way to make new friends.

I’m no technogeek and I have in good faith struggled myself to adhere to these techniques. I frankly still don’t get it and wonder how many others ever will, let alone have any interest or desire to do so. Anything that is harder than writing a Web page is too hard. Tasks need to be approachable in bit-size chunks and in true REST style.

There have been some very important voices of reason speaking to these issues on the specialty mailing lists in the semantic Web community, but I perceive denial and an unwillingness to engage a meaningful dialog and quick resolution to these matters. Community, I can not say this more clearly: Wake up! It is time for a mature acceptance of reality to set in and to get on with the real task of solving problems within the context of the real Web and its denizens.

Meaningful Semantics is Tougher Than it Looks

The flip side to this unnecessary complexity is an almost childlike simplicity to what the true “semantics” of the semantic Web really means. Standard current arguments tend to be framed about whether the same person or named entity in one representation is the same as or somehow different than other representations.

Granted, this is but a continuation of the resource v. representation issue earlier mentioned. And, while the community gets diverted in such fruitless and unnecessary debates, real semantic complexity gets overlooked. Yet these are complex and vexing boils just ready to erupt through the surface.

This strange fixation on topics that should be easily resolved and lack of attention to real semantic topics is perhaps a function of the community’s heritage. Most semantic Web researchers have a background in computer science. Linguists, semioticians, philosophers, ontologists, and disciplines with real semantic training and background certainly are active in the community, but are greatly outnumbered.

A short story will illustrate this point. About two years ago I began collecting a listing of as many semantic Web tools as I could find, with the result that Sweet Tools has become one of the community’s go-to listings. But, shortly after I began publishing the list, while getting nice compliments from the W3C’s semantic Web tools site, it was also noted in passing that my site “also lists tools that are related to Semantic Web though not necessarily using the core technology (e.g., natural language ontology tools). . . .”[3] If resolving and understanding the meaning of language is not at the core of the semantic Web, it is hard to know what is.

Frankly, it is this semantics aspect that will push real progress on the semantic Web back into the future. While Linked Data is showing us the technical aspects for how we can bring disparate data together, and ontologies will help us place that data into the right ballpark — all of which is and will bring real benefits and value — much work still remains.

Semantic mediation — that is, resolving semantic heterogeneities — must address more than 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language. [4] Possible drivers in semantic mismatches can occur from world view, perspective, syntax, structure and versioning and timing:

  • One schema may express a similar world view with different syntax, grammar or structure
  • One schema may be a new version of the other
  • Two or more schemas may be evolutions of the same original schema
  • There may be many sources modeling the same aspects of the underlying domain (“horizontal resolution” such as for competing trade associations or standards bodies), or
  • There may be many sources that cover different domains but overlap at the seams (“vertical resolution” such as between pharmaceuticals and basic medicine).

These differences in purpose and provenance are the sources of these mismatches. Pluempitiwiriyawej and Hammer classify heterogeneities into three broad classes:[5]

  • Structural conflicts arise when the schema of the sources representing related or overlapping data exhibit discrepancies. Structural conflicts can be detected when comparing the underlying ontologies. The class of structural conflicts includes generalization conflicts, aggregation conflicts, internal path discrepancy, missing items, element ordering, constraint and type mismatch, and naming conflicts between the element types and attribute names.
  • Domain conflicts arise when the semantics of the data sources that will be integrated exhibit discrepancies. Domain conflicts can be detected by looking at the information contained in the ontologies and using knowledge about the underlying data domains. The class of domain conflicts includes schematic discrepancy, scale or unit, precision, and data representation conflicts.
  • Data conflicts refer to discrepancies among similar or related data values across multiple sources. Data conflicts can only be detected by comparing the underlying actual data. The class of data conflicts includes ID-value, missing data, incorrect spelling, and naming conflicts between the element contents and the attribute values.

Moreover, mismatches or conflicts can occur between set elements (a “population” mismatch) or attributes (a “description” mismatch).

The table below thus builds on their schema by adding the fourth major explicit category of language, leading to about 40 distinct potential sources of semantic heterogeneities:

Class

Category

Subcategory

STRUCTURAL

Naming

Case Sensitivity
Synonyms
Acronyms
Homonyms
Generalization / Specialization
Aggregation Intra-aggregation
Inter-aggregation
Internal Path Discrepancy
Missing Item Content Discrepancy
Attribute List Discrepancy
Missing Attribute
Missing Content
Element Ordering
Constraint Mismatch
Type Mismatch
DOMAIN SchematicDiscrepancy Element-value to Element-label Mapping
Attribute-value to Element-label Mapping
Element-value to Attribute-label Mapping
Attribute-value to Attribute-label Mapping
Scale or Units
Precision
DataRepresentation Primitive Data Type
Data Format
DATA Naming Case Sensitivity
Synonyms
Acronyms
Homonyms
ID Mismatch or Missing ID
Missing Data
Incorrect Spelling
LANGUAGE Encoding Ingest Encoding Mismatch
Ingest Encoding Lacking
Query Encoding Mismatch
Query Encoding Lacking
Languages Script Mismatches
Parsing / Morphological Analysis Errors (many)
Syntactical Errors (many)
Semantic Errors (many)

Most of these line items are self-explanatory, but a few may not be. See further [4].

Mature and accepted ontologies — largely lacking in most topics and domains — will be key sources to help overcome some of these heterogeneities, but are still quite far off. Yet ontologies alone will never be complete enough to resolve all mismatches.

Some Respectful Recommendations

This whirlwind tour of the state of the semantic Web shows tremendous progress and tremendous efforts still to come. The vision of the semantic Web is certainly one of a process — an ongoing journey — and one that defies facile labels. Further, there is real meaning behind the semantic Web and its striving to find meaning in heretofore disconnected and disparate data.

This is a journey of substance and value. It is one that is exciting and challenging and rewarding. As someone once joked to me, “Welcome to the multi-generational, full-employment act!”

The value to be gained from the semantic Web enterprise is such that perhaps we can not avoid the hucksters and hypsters and spinmeisters. At times there seems to be a chilly wind of big bucks and excessive VC froth blowing through the community. I guess the best advice is to stay attuned to where real meaning and substance occurs, and hold your wallet in the other parts of town.

We as a community could be doing better, however, in the nature and language of the substance that we do offer to the broader public. With that aim, let me offer some thoughts on immediate steps we can take to promote that objective:

  • The community’s ultimate banner needs to be Berner-Lee’s Semantic Web vision. Whether we use upper or lower case does not matter. The Economist magazine, for example, now lowercases the Web and Internet; these are journalistic and style differences ultimately. Talking about upper and lower case Semantic Web starts to look silly in that light though personally I use semantic Web (because I still capitalize Web and Internet! and don’t like to convey total upper case “states”). But, hey, that is my own preference and ultimately who cares?
  • As a community, we should try diligently to convey the semantic Web as a process and not a fixed event or stage of the Web. Insofar as terms and labels can help the public understand our current state and capabilities, use all of the terms and labels within our language to communicate and convey meaning under this umbrella. In any event, let us abandon silly version numbers that mean nothing and embrace the idea of process and stages, not fixed realities
  • Let us pledge to not try to introduce language and terminology for branding, hype and “ownership” reasons. Not only does it fail, but it undercuts the entire community’s prospects and timing of when it will be successful. Just as our community has moved to open standards, open source and open data, realize that hype and branding crap is closed crap, and counter to our collective interest, and
  • We need to abandon or revise the earlier layer cake by removing its prominent role for XML.

Finally, if they would accept it (though it is presumptuous of me since I don’t know them and have not asked them), have the W3C ask Pat Hayes and Roy Fielding to work out the resource and representational terminology issues. Perhaps we need a bit of benevolent dictatorship at this point rather than more muddling by committee.

Besides finding I personally agree with and like most of what these two write, Pat is a great choice because he is the author of the best description of the semantic underpinnings of RDF [6] and Roy is a great choice because he clearly understands the representational nature of the Web architecture as exemplified in his REST thesis [7]. The time is now and sorely needed to get the issues of representation, resources and reference cleaned up once and for all. The W3C TAG, though dedicated and obviously well-intentioned, has arguably not helped matters in these regards. I would be gladly willing to give Pat and Roy my proxy (assuming I had one :) ) on issues of terminology and definitions.


[1] Last July I wrote a piece entitled, More Structure, More Terminology and (hopefully) More Clarity. It, and related posts on the structured Web, had as its thesis that the Web was naturally evolving from a document-centric basis to a “Web of Data”. We already have much structured data available and the means through RDFizers and other techniques to convert that structure to Linked Data. Linked Data thus represented a very doable and pragmatic way station on the road to the semantic Web. It is a journey we can take today; indeed, many already are as growth figures attest

.[2] V.M. Markowitz and O. Ritter, “Characterizing Heterogeneous Molecular Biology Database Systems,” in Journal of Computational Biology 2(4): 547-546, 1995.

[3] That statement has changed in nuance a number of times over the months, and was finally removed from the site in about January 2008.

[4] Much of this section was drawn from a posting by me on Sources and Classification of Semantic Heterogeneities in June 2006.

[5] Charnyote Pluempitiwiriyawej and Joachim Hammer, “A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources,” Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See ftp.dbcenter.cise.ufl.edu/Pub/publications/tr00-004.pdf.

[6] Patrick Hayes, ed., 2004. RDF Semantics, W3C Recommendation 10 February 2004. See http://www.w3.org/TR/rdf-mt/.

[7] Roy T. Fielding, 2000. Architectural Styles and the Design of Network-based Software Architectures, doctoral thesis, Department of Information and Computer Science, University of California at Irvine, 179 pp. See http://www.ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation_2up.pdf

Posted by AI3's author, Mike Bergman

Posted on March 12, 2008 at 12:00 am in Semantic Web, Structured Web | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/426/the-shaky-semantics-of-the-semantic-web/
The URI to trackback this post is: http://www.mkbergman.com/426/the-shaky-semantics-of-the-semantic-web/trackback/
Page 14 of 19« First...1213141516...Last »
Copyright © 2004–2013 Michael K. Bergman.   This work is licensed under a Creative Commons License