Posted:September 2, 2009

Segmented UMBEL (Upper Mapping and Binding Exchange Layer)The Significant Advantages to a Logically Segmented TBox

The Message Understanding Conferences (MUC) were initiated in 1987 and financed by DARPA to encourage the development of new and better methods of information extraction (IE). It was a seminal series that resulted in basic measures of retrieval and semantic efficacy, recall (R) and precision (P) and the combined F-measure, and other core terminology and constructs used by IE today.

By the sixth version in the series (MUC-6), in 1995, the task of recognition of named entities and coreference was added. That initial slate of named entities included the basic building blocks of person (PER), location (LOC), and organization (ORG); to these were added the numeric building blocks of time, percentage or quantity. The very terminology of named entity was coined for this seminal meeting, as was the idea of inline markup [1].

What is a ‘Nameable Thing’?

The intuition surrounding “named entity” and nameable “things” was that they were discrete and disjoint. A rock is not a person and is not a chemical or an event. As initially used, all “named entities” were distinct individuals. But, there also emerged the understanding that some classes of things could also be treated as more-or-less distinct nameable “things”: beetles are not the same as frogs and are not the same as rocks. While some of these “things” might be a true individual with a discrete name, such as Kermit the Frog, or The Rock at Northwestern University, most instances of such things are unnamed.

The “nameability” (or logical categorization) of things is perhaps best kept separate from other epistemological issues of distinguishing sets, collections, or classes from individuals, members or instances.

In a closed-world system it is easier to enforce clean distinctions. The Cyc knowledge base, for example, the basis for UMBEL (Upper Mapping and Binding Exchange Layer),  makes clear the distinction between individuals and collections. In the semantic Web and RDF, this can become smeared a bit with the favored terminology shifting to instances and classes, and in pragmatic, real-world terms we (as humans) readily distinguish John Smith as distinct from Jane Doe but don’t generally (unless we’re entomologists!) make such distinctions for individual beetles, let alone entire genera or species of beetles.

Under precise conditions, these distinctions are important. The fact that Cyc, for example, is assiduous in its application of these distinctions is a major reason for the overall coherence of its knowledge base. But, for most circumstances, we think it is OK to accept a distinction between “nameable” things such as frogs and beetles, but also to accept that there may be nameable individuals at times in those groupings such as Kermit that are truly an individual in that more refined sense.

This digression sets the background for a natural progression from that first MUC-6 conference. If we could cluster persons or organizations, why not other categories of distinct and disjoint things such as frogs or beetles or rocks?

From the first six entity categories of MUC-6 we begin to see an expansion to broader coverage. Readers of this blog will recall that I have been a fan for quite some time of the expanded coverage of 64 classes of entities proposed by BBN or the 200 proposed by Sekine [2] (as discussed, for example in the April 2008 Subject Concepts and Named Entities article). Again, the intuition was that real things in the real world could be logically categorized into discrete and disjoint categories.

Thus, “named entities” inexorably moved to become a categorization system, where the degree of familiarity and distinction dictated whether it was the individual (with a unique name, such as Abraham Lincoln or Mt. Rushmore) or groupings such as animal or plant species and their common names (such as beetle or oak) that was the standard “handle” for assigning a name to the “nameable thing”.

While many can argue these individual <–> grouping distinctions and whether we are talking about true, unique, named individuals or names of convenience, I think that (at least for this blog post and discussion), that misses the real, fundamental point.

The real, fundamental point is that some “things” (whether individuals, instances or classes) are distinct from other “things”. Such disjoint distinctions are a powerful concept that should not be lost sight of by “angels dancing on the head of a pin” epistemological arguments. A frog is not a rock, despite neither are “individuals”, and how can we take advantage of that realilty?

What Works for Entities, Works for Concepts

Nearly from the outset of our work with UMBEL as a ‘TBox’ [3] — that is, as a set of 20,000 or so common “subject concepts” — the natural question was what the relation or correspondence was of these concepts to the underlying “things” (entities) that they organized. As we probed the disjoint categories within the Sekine 200 entity types, for example, we began to see significant parallels and overlap. Also gnawing at our sense of order was the rather artificial and arbitrary class of concepts in UMBEL that we termed “Abstract Concepts”.

We introduced Abstract Concepts in the first release of UMBEL. When introduced, we defined “Abstract concepts [as] representing abstract or ephemeral notions such as truth, beauty, evil or justice, or [as] thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world.” In pragmatic terms, Abstract Concepts in UMBEL were often pivotal nodes in the UMBEL subject graph necessary to maintain a high degree of concept interconnectivity.

In any world view that attempts to be more-or-less comprehensive, there is a gradation of concepts from the concrete and observable to the abstract and ephemeral. The recognition that some of these concepts may be more abstract, then, was not the issue. The issue was that there was no definable basis for segregating a concrete Subject Concept from the more Abstract Concept. Where was the bright line? What was the actionable distinction?

Off and on we have probed this question for more than a year, and have looked at what might constitute a more natural and logical ordering and segmentation within UMBEL. After many tests and detailed analysis, we are now releasing the first results of our investigations.

For, like nameable entities or things, we can see a logical segmentation of (mostly) disjoint concepts within the UMBEL TBox. Here are the summary percentages of these high-level splits:

Disjoint Concepts 90%
Attributes 1%
Classifications 9%
TOTAL 100%

(Because the analysis is still being refined, exact counts and percentages for the 20,000 concepts in UMBEL are not provided.)

Why a Logical Segmentation?

As we dove deeper into these ideas, not only could we see the basis for a logical segmentation within UMBEL’s concepts, but manifest benefits from doing so as well. Remember that UMBEL’s concept structure performs two main roles. It:  1) provides a coherent framework for relating and “mapping” other external ontologies; and 2) provides conceptual binding points for organizing entities and instances [4]. Via logical segmentation, we get benefits for both roles.

Here are some of the broad areas of benefit from a logical UMBEL segmentation that we have identified:

  • Template-driven — as we discuss elsewhere, Structured Dynamics also uses its ontologies to “drive applications” and the user interfaces (UI) that support them. By proper segmentation of UMBEL concepts, we are able to determine to what “cluster” of things (which we call either dimensions or superTypes; see below) a given thing belongs. This identification means we can also determine how best to display information about that “thing”. This determination can include either the attributes or the display templates appropriate for that thing. For example, location-based things or time-based things might invoke map or calendar or timeline type displays. Moreover, because of the logical segmentation of concepts, we can also use the power of the concept graph to infer more generic display templates when specific matches are absent
  • Computational Efficiency — as the percentages above indicate, once we identify what superType concept to which a given instance belongs, we can eliminate nearly all remaining UMBEL concepts from consideration. This logical winnowing leads to computational efficiencies at all levels in the system. The fastest computational work is not to do it, and when large chunks of data are removed from consideration, many performance advantages accrue
  • Disambiguation — via this approach we now can assess concept matches in addition to entity matches. This means we can triangulate between the two assessments to aid disambiguation. Because of these logical segmentations, we also have multiple “clusters” (that is, either the concept, type, superType or dimension) upon which to do our disambiguation evaluations, either between concepts and entities or within the various concept clusters. We can do so via either multiple semantic vectors (for statistical-based methods) or multiple features (for machine learning methods). In other words, because of logical segmentation, we have increased the informational power of our concept graph
  • Structure and Integrity Testing — the very mindset of looking for logical segmentation has led to much learning about the UMBEL structure and OpenCyc upon which it is based. In the process, missing nodes (concepts), erroneous assignments, and superfluous nodes are all being discovered. Further, many of these tests can be automated using basic logical and inference approaches. The net result is a constant improvement to the scope and completeness of the structure. Lastly, these same approaches can be applied when mapping external ontologies to UMBEL, providing similar consistency benefits.

With these benefits in mind, we have undertaken concerted analysis of UMBEL to discern what this “logical segmentation” might be. This investigation has occurred over three concentrated periods over the past year. (Intervening priorities or other work prevented concentrating solely on this task.)

We are now complete with our first full iteraton of investigation. In this post, and then the subsequent release of UMBEL version 0.80 in the coming weeks, the fruits of this effort should be evident. However, it should also be noted that we are still learning much from this new mindset and approach. UMBEL structure refinement may be likely for some time to come.

UMBEL Analysis

Most things and concepts about them are based on real, observable, physical things in the real world. Because most of these things can not occupy both the same moment in time and the same location in physical space, a useful criterion for looking at these things and concepts is disjointedness.

In a broad sense, then, we can split our concepts of the world between those ideas that are disjoint because they pertain to separable objects or ideas and those that are cross-cutting or organizational or classificatory. Attributes, such as color (pink, for example), are often cross-cutting in that they can be used to describe quite disparate things. Inherent classification schemes such as academic fields of study or library catalog systems — while useful ways to organize the world — are not themselves in-and-of the world or discrete from other ideas. Thus, classificatory or organizational concepts are inherently not disjoint.

With the criterion of disjointedness in hand, then, we began an evaluation process of the UMBEL subject concepts. We looked to organizational schema such as the entity types of Sekine or BBN for some starting guidance. We also kept in mind that we also wanted our categories to inform logical clusterings of possible data presentation, such as media types or locations or time.

For terminology, we adopted the term superType to denote the largest cluster designation upon which this disjointedness may occur. As a way to test the basic coherence of these superTypes, we also collected them into larger groups which we termed dimensions.

Our analysis process began with branch-by-branch testing of the UMBEL concept graph using automated scripts, attempting to find pivotal nodes where child instance members were disjoint from other superTypes. This we term the “top-down” method.

This automated analysis was then supplemented with a complete manual inspection of all unassigned and assigned concepts, with a “bottom up” assignment of concepts or corrections to the automated approach. This inspection then led to new insights and identification of missing concepts that needed to be added into UMBEL.

We are still converging between these two methods. Optimally, we should be able to tease out all UMBEL superTypes with a relatively few number of union, intersection, or complement set operations. In its current form, we are close, but there are still some rough spots.

Nonetheless, this analysis method has led us to identify some 33 superTypes [5], clustered into 9 dimensions. Of these, 29 superTypes and 8 dimensions are mostly disjoint. The one dimension of Classificatory includes the four cross-cutting superTypes of attributes and organizational schema that can apply to any of the 29 disjoint superTypes.

UMBEL superTypes

Here is the schema, with the descriptions of each:

Dimension superType Description/Sub-types
Natural World Natural Phenomena This superType includes natural phenomena and natural processes such as weather, weathering, erosion, fires, lightning, earthquakes, tectonics, etc. Clouds and weather processes are specifically included. Also includes climate cycles, general natural events (such as hurricanes) that are not specifically named, and biochemical processes and pathways.
Natural Substances Notable inclusions are minerals, compounds, chemicals, or physical objects that are not the outcome of purposeful human effort, but are found naturally occurring. Other natural objects (such as rock, fossil, etc.) are also found under this superType.
Earthscape The Earthscape superType consists mostly of the collection of cartographic features that occur on the surface of the Earth. Positive examples include Mountain, Ocean, and Mesa. Artificial features such as canals are excluded. Most instances of these features have a fixed location in space.Underground and underwater are also explicitly contained.This superType is explicitly disjoint with Extraterrestrial (see below).
Extraterrestrial This superType includes all natural things not specifically terrestrial, including celestial bodies (planets, asteroids, stars, galaxies, etc., that can be located within a sky map)
Living Things Prokaryotes The Prokaryotes include all prokaryotic organisms, including the Monera, Archaebacteria, Bacteria, and Blue-green algas. Also included in this superType are viruses and prions.
Protists or Fungus This is the remaining cluster of eukaryotic organisms, specifically including the fungus and the protista (protozoans and slime molds).
Plants This superType includes all plant types and flora, including flowering plants, algae, non-flowering plants, gymnosperms, cycads, and plant parts and body types. Note that all Plant Parts are also included.
Animals This large superType includes all animal types, including specific animal types and vertebrates, invertebrates, insects, crustaceans, fish, reptiles, amphibia, birds, mammals, and animal body parts. Animal parts are specifically included. Also, groupings of such animals are included. Humans, as an animal, are included (versus as an individual Person). Diseases are specifically excluded.
Diseases Diseases are atypical or unusual or unhealthy conditions for (mostly human) living things, generally known as conditions, disorders, infections, diseases or syndromes. Diseases only affect living things and sometimes are caused by living things. This superType also includes impairments, disease vectors, wounds and injuries, and poisoning
Person Types The appropriate superType for all named, individual human beings. This superType also includes the assignment of formal, honorific or cultural titles given to specific human individuals. It further includes names given to humans who conduct specific jobs or activities (the latter case is known as an avocation). Examples include steelworker, waitress, lawyer, plumber, artisan. Ethnic groups are specifically included.
Human Activities Organizations Organization is a broad superType and includes formal collections of humans, sometimes by legal means, charter, agreement or some mode of formal understanding. Examples include geopolitical entities such as nations, municipalities or countries; or companies, institutes, governments, universities, militaries, political parties, game groups, international organizations, trade associations, etc. All institutions, for example, are organizations.Also included are informal collections of humans. Informal or less defined groupings of humans may result from ethnicity or tribes or nationality or from shared interests (such as social networks or mailing lists) or expertise (“communities of practice”). This dimension also includes the notion of identifiable human groups with set members at any given point in time. Examples include music groups, cast members of a play, directors on a corporate Board, TV show members, gangs, mobs, juries, generations, minorities, etc.Finally, Organizations contain the concepts of Industries and Programs and Communities.
Finance & Economy This superType pertains to all things financial and with respect to the economy, including chartable company performance, stock index entities, money, local currencies, taxes, incomes, accounts and accounting, mortgages and property.
Culture, Issues, Beliefs This category includes concepts related to political systems, laws, rules or cultural mores governing societal or community behavior, or doctrinal, faith or religious bases or entities (such as gods, angels, totems) governing spiritual human matters. Culture, Issues, beliefs and various activisms (most -isms) are included
Activities These are ongoing activities that result (mostly) from human effort, often conducted by organizations to assist other organizations or individuals (in which case they are known as services, such as medicine, law, printing, consulting or teaching) or individual or group efforts for leisure, fun, sports, games or personal interests (activities)
Human Works Products This is the largest superType and includes any instance offered for sale or performed as a commercial service. Often physical object made by humans that is not a conceptual work or a facility, such as vehicles, cars, trains, aircraft, spaceships, ships, foods, beverages, clothes, drugs, weapons. Products also include the concept of ‘state’ (e/g/., on/off)
Food or Drink This superType is any edible substance grown, made or harvested by humans. The category also specifically includes the concept of cuisines
Drugs This superType is an drug, medication or addictive substance
Facilities Facilities are physical places or buildings constructed by humans, such as schools, public institutions, markets, museums, amusement parks, worship places, stations, airports, ports, carstops, lines, railroads, roads, waterways, tunnels, bridges, parks, sport facilities, monuments. All can be geospatially located.Facilities also include animal pens and enclosures and general human “activity” areas (golf course, archeology sites, etc.). Importantly, Facilities include infrastructure systems such as roadways and physical networks.Facilities also include the component parts that go into making them (such as foundations, doors, windows, roofs, etc.)
Information Chemistry (n.o.c) This superType is a residual category (n.o.c., not otherwise categorized) for chemical bonds, chemical composition groupings, and the like. It is formed by what is not a natural substance or living thing (organic) substance.
Audio Info This superType is for any audio-only human work. Examples include live music performances, record albums, or radio shows or individual radio broadcasts
Visual Info This superType includes any still image or picture or streaming video human work, with or without audio. Examples include graphics, pictures, movies, TV shows, individual shows from a TV show, etc.
Written Info This superType includes any general material written by humans including books, blogs, articles, manuscripts, but any written information conveyed via text.
Structured Info This information superType is for all kinds of structured information and datasets, including computer programs, databases, files, Web pages and structured data that can be presented in tabular form
Notations & References Akin to conceptual works, these are codified means of human expression. Examples range from human languages themselves, to more domain-specific cases such as chemical symbols, genetic code (A-G-C-T), protocols, and computer languages, mathematical and set notations, etc.Identifiers (numeric or alphanumeric identifiers for objects, often in a highly patterned way, such as phone numbers, URLs, zip and postal codes, SKUs, product codes, etc.), Units (any of the various ways in which measurement, space, volume, weight, speed, intensity, temperature, calories, siesmic intensity or other quantitative descriptions of phenomena can be made) and key reference types are also included in this superType
Numbers This unique superType is for any abstract representation of numbers and numerics
Human Places Geopolitical Named places that have some informal or formal political (authorized) component. Important subcollections include Country, IndependentCountry, State_Geopolitical, City, and Province.
Workplaces, etc. These are various workplaces and areas of human activities, ranging from single person workstations to large aggregations of people (but which are not formal political entities)
Time-related Events These are nameable occasions, games, sports events, conferences, natural phenomena, natural disasters, wars, incidents, anniversaries, holidays, or notable moments or periods in time
Time This superType is for specific time or date or period (such as eras, or days, weeks, months type intervals) references in various formats
Descriptive Attributes This general superType category is for descriptive attributes of all kinds. Think of the specific attributes in Wikipedia “infoboxes” to understand the purpose and coverage of this superType. It includes colors, shapes, sizes, or other descriptive characteristics about an object
Classificatory Abstract-level This general superType category is largely composed of former AbstractConcepts, and represent some of the more abstract upper-level nodes for connecting the UMBEL structure together. This superType also includes theories or processes or methods for humans to do stuff or any human technology
Topics/Categories This largely subject-oriented superType is a means for using controlled vocabularies and classification schemes for characterizing what content “is about”. The key constituents of this category are Types, Classifications, Concepts, Topics, and controlled vocabularies
Markets & Industries This superType is a specialized classificatory system for markets and industries. It could be combined with the superType above, but is kept separate in order to provide a separate, economy-oriented system.

These may undergo some further refinement prior to release of UMBEL v 0.80, and some of the definitions will be tightened up.

(Note: It should also be mentioned that some of these superTypes further lend themselves to further splits and analysis. The Product superType, for example, is ripe for such treatment.)

Distribution of superTypes

The following diagram shows the distribution of these 20,000 UMBEL concepts across major area. By far the largest superType is Products, even with further splits into Food and Drinks and Pharmaceuticals. The next largest categories are Person and Places and Events superTypes, with Organizations and Animals not far behind:

Even in its generic state, UMBEL provides a very rich vocabulary for describing things or for tying in more detailed external ontologies. There are nearly 5,000 concepts across products of all types, for example.

Possible Overlaps (non-disjoint) between superTypes

You may recall that our analysis showed 29 of the superTypes to be “mostly disjoint.”  This is because there are some concepts — say, MusicPerformingAgent — that can apply to either a person or a group (band or orchestra, for example). Thus, for this concept alone, we have a bit of overlap between the normally disjoint Person and Organization superTypes.

The following shows the resulting interaction matrix where there may be some overlap between superTypes:

This kind of interaction diagram is also useful for further analyzing the concept graph structure, as well.

Even Where Overlaps Occur, They are Minor

Of the 29 “mostly” disjoint superTypes, only a relatively few show potential interactions, and then only in minor ways. We can illustrate this (drawn to scale) for the interaction between the Product, Food & Drink and Drug (Pharmaceuticals) superTypes, with the fully disjoint Organization superType thrown in for comparison:

Example superTypes Overlap

Across all 20,000 concepts, then, fully 85% are disjoint from one another (5% is lost due to overlaps between “mostly” disjoint superTypes). This is a surprising high percentage, with even better likelihood to deliver the benefits previously noted.

Interim Conclusions and Observations

These are exciting findings that bode well for UMBEL’s ongoing role and usefulness. Also, the very detailed analysis that has led to these interim findings very much reaffirms the wisdom of basing UMBEL on Cyc.  Cyc showed itself to be admirably coherent and remarkably complete. (It also appears that the first versions of UMBEL were also extracted well in terms of good coverage.)

This approach now gives us an understandable and defensible basis for logical segementation of UMBEL. It also provides a much-desired alternative to the earlier Abstract Concepts, which will now be dropped entirely as a schema concept.

One area deserving further attention is in the Attribute superType. We are in the process, for example, of analyzing attributes across Wikipedia and need to look through a slightly different lens at this superType [6]. This area is further important in its strong interaction with the Instance Record Vocabulary that is accompanying this effort on the entity side.

Another lesson for us has been to back away from the terminology of named entity, introduced at MUC-6. The expansions of that idea into other “nameable” things has caused us to embrace the “instance” nomenclature, as evidenced by our emerging IRV.

It is rewarding to prepare this next iteration release of UMBEL with its new mindset of logical segmentation and disjointedness. But — what is also clear — there are many treasures left to mine still hidden in the inherent structure of UMBEL and its Cyc parent.


[1] The original labels were ENAMEX for entity named expression and NUMEX for numeric expression. The markup format specified was also SGML. For an interesting history of this MUC-6 watershed, see Ralph Grishman and Beth Sundheim, 1996. Message Understanding Conference – 6: A Brief History, in Proceedings of the 16th International Conference on Computational Linguistics (COLING), I, Kopenhagen, 1996, 466–471.

[2] In a named entity, the word named applies to entities that have a “rigid designators” as defined by Kripke for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind of terms like biological species and substances.Sekine’s extended hierarchy proposed in 2002 is made up of 200 subtypes, with 32 larger clusters within that. Here is the top level of the Sekine type system:

Name-Other Title Timex Frequency
Person Unit Periodx Rank
Organization Vocation Numex-Other Age
Location Disease Money School Age
Facility God Stock Index Latitude Longitude
Product ID Number Point Measurement
Event Color Percent Countx
Natural Object Time-Other Multiplication Ordinal Number

Though developed separately and for different purposes, BBN categories also proposed in 2002 consists of 29 types and 64 subtypes. Here are the BBN types (Note: BBN claims 29 types because there are double entries or considerations for the first five entries):

Person Time Animal
NORP (adjectival GPEs) Percent Substance
Facility Money Disease
Organization Quantity Work of Art
GPE (geopolitical places) Ordinal Law
Location Cardinal Language
Product Events Contact Info
Date Plant Game

Of course, other entity extraction systems have similar clusterings and approaches. Though less formal in the sense of a hierarchy or purported complete entity coverage, here for example is the listing of entity types within Calais:

Anniversary FaxNumber NaturalFeature RadioProgram
City Holiday OperatingSystem RadioStation
Company IndustryTerm Organization Region
Continent MarketIndex Person SportsEvent
Country MedicalCondition PhoneNumber SportsGame
Currency Movie Position SportsLeague
EmailAddress MusicAlbum Product Technology
EntertainmentAwardEvent MusicGroup ProgrammingLanguage TVShow
Facility NaturalDisaster ProvinceOrState TVStation
PublishedMedium URL

See further the Wikipedia entry on named entity recognition.

[3] We use the reference to “TBox” in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[4] UMBEL also provides a SKOS-based vocabulary extension for describing other domains and mappings between classes and instances. This purpose, however, is outside of the scope of this current article.
[5] As a reference roadmap, UMBEL was specifically designed not to include meronymous (part of) relationships (see further this reference). Thus, all “part of” type concepts were assigned to the whole superType category for which they are a part. Thus, “animal parts” are assigned to the superType Animal; “car parts” to the superType Product.
[6] For a general discussion of attributes and their relation to entities, see Satoshi Sekine, 2008. Extended Named Entity Ontology with Attribute Information, in Proceedings of the 6th edition of the Language Resources and Evaluation Conference (LREC 2008). Marrakech, Morocco. See http://www.lrec-conf.org/proceedings/lrec2008/pdf/21_paper.pdf.
Posted:August 17, 2009

Structured Dynamics LLC

Ontology Best Practices for Data-driven Applications: Part 4

The earlier portions of this occasional series have set the groundwork for the role of ontologies in data-driven applications. In this part, I address many of the current misconceptions of what ontologies do or do not do. For, as practiced by Structured Dynamics, our adaptive TBox-level ontologies [1] are definitely not your grandfather’s Oldsmobile.

To share the punch line early, these modern ontologies are fast to develop, easy to change, adaptive to new knowledge and perceptions, robust and flexible. Indeed, it is the structure and nature of these adaptive ontologies that is the heart and secret of data-driven applications.

Any knowledge worker can understand and refine the organization and relationship of information via these structures. And, most importantly, the resulting ontologies are sufficient to drive the generic applications that are based on them. Focusing on data and structure now becomes the emphasis. We can now remove prior bottlenecks arising from the need to customize applications, configure report writers, or wait for IT to generate SQL queries.

But, not all ontologies are created equally and not all practitioners explain or see them in the same way. The purpose of this Part 4 in our series is to present many of the misconceptions, offering a score of takeaway messages for how properly considered and constructed ontologies can achieve these benefits.

Misconception: No ‘Big Bang’ Needed

To be sure, there are many very large and comprehensive ontologies. Some are focused on specific applications or domains; some are general; and some are the result of large and well-funded projects [2]. I am not arguing that such efforts do not have their role and place. But when viewed as exemplars or notable cases, these complex and comprehensive ontologies can create a misconception that such a scope is an imperative of proper ontology design.

I believe quite the opposite to be true.

An incredible strength of RDF and OWL ontologies is that they can be built incrementally. So long as additions are coherent with some degree of self-consistency in terms of the world view in which they are represented, any of an ontology’s constituent concepts, predicates or entities and datasets can be added and enhanced as needed. This makes ontologies a very different cat from relational schema, which are notoriously brittle with expensive re-architecting required anytime that scope or schema change.

Enterprise consultants that advocate “big” upfront ontology development efforts are doing their clients a massive disservice. They are also cynically playing on the experience with relational schema. As soon as the marketplace begins to realize that ontologies are incredibly plastic and malleable, this huge advantage of ontologies over the relational model for data federation will ring clear.

Takeaway Message #1: Ontologies can (and should!) start small.

Takeaway Message #2: Ontologies can (and should!) grow incrementally.

Misconception: No ‘One Ring to Rule Them All’

As a practitioner, two of the most boring arguments I hear are: Ontology X is better than other ontologies and here is why; and, Use of some reference or upper ontology reduces choice and freedom. Both arguments are somewhat grounded in the ‘one ring to rule them all’ mindset — though coming from opposing perspectives — that I think fundamentally misreads the role and purpose of ontologies.

Ontologies provide an organizing context for relating disparate information together and for making meaningful inferences. Without such a framework these purposes can not be achieved. But the framework itself is a function of the world view, context and domain scope at hand. As a result, there is only context, and not some single, universal “truth.” As they say, it all depends.

The trick, then, to properly designed ontologies is to maintain internal coherence and self-consistency [3]. When done, it is then possible to relate disparate information and data to other data and to make intelligent business inferences.

So, the use of an ontology does not limit freedom. It sets the context for making connections and setting relations. And, as long as it is coherent, the “correct” ontology is the one that best captures the scope and domain at hand. Arguing for one ontology v another is wasted energy. Just get on with it.

Takeaway Message #3: There is no single “truth”, only coherence and relevant context.

Misconception: No Such Thing as an ‘Ontological Commitment’

One of the more pernicious ideas promoted by some practitioners or advocates is the idea of ‘ontological commitment.’ Though some definitions are relatively benign, such as the one offered by the Stanford Knowledge Systems Laboratory (KSL) [4], the unfortunate use of the term “commitment” implies permanence and immutability. (In fact, most definitions of this phrase affirm this interpretation.)

This is really unfortunate, as it again tends to reinforce the inaccurate analogies with brittle and inflexible relational schema.

A much better way to view ontologies is not as a “commitment,” but as a vehicle for developing a common world view within the enterprise. Under this viewpoint, ontology development is somewhat analogous to master data management (MDM) or corporate taxonomies [5]. In this broader sense, then, ontology development can become a means for developing and refining a common language within the enterprise through consensual or community processes.

For the reasons as noted above, as language or conceptual relationships or understandings change, so can the vocabulary or structural character of the ontology change. There is no “lock in”; there is no “commitment”. As long as it is coherent, the ontology can morph to reflect the scope and understandings of the current snapshot in time.

This flexibility results from the fact that the ontologies, properly constructed, can drive a generic set of tools and applications that express themselves based on the underlying structure and vocabulary within those ontologies. The ontologies can thus change at will without any adverse effects whatsoever on the applications based on them.

This data-driven aspect, as noted throughout this series, is quite different from any prior paradigm. So, under this view ontologies have considerably more focus and importance than even some of the strongest ontology advocates claim, yet paradoxically without the theoretical bloat or heaviness many purport. Like human languages, our language and concepts within ontologies change as our world and perceptions change.

Takeaway Message #4: There is no “lock-in” with ontologies; they may be modified and changed at will.

Takeaway Message #5: Like corporate taxonomies or MDM, ontologies provide a framework for enterprises to develop internally consistent common languages or vocabularies.

Takeaway Message #6: Unlike corporate taxonomies or MDM, ontologies can drive directly generic tools and applications.

Misconception: No Need for Completeness or Comprehensiveness

Ontology development is not some imperative for conceptual “truth”; rather, it is a very adaptable means for stating, testing and refining stuff. Like agile development for software, this refining approach can and should proceed incrementally. Too often ontology efforts get caught like deer in the headlights awaiting some “completeness” threshold before release.

One means to promote this approach is to tackle single datasets or data stores individually before moving on. Having a sense of the eventual scope is useful, of course. But it is also quite acceptable to only fill out those portions of the structure with data available at hand.

These observations reflect a prejudice to action and release, rather than theory. If mistakes are made, fine: simply correct them.

Takeaway Message #7: Understand the full scope, but only build out for the data in hand.

Misconception: No Need for Predicate Bloat

It is advisable to keep relationships (predicates) simple at first. Because, again, like human languages, keeping the verbs simple until fluency is gained is another best practice.

While all of us can see nuances and subtleties heading into a project, trying to accommodate those predicates (relationships) at the outset can introduce unnecessary complexity. This is not an advocacy in any way for inaccurate predicates, but perhaps to err on the side of the general and broader at first.

For organizations familiar with taxonomies, the SKOS vocabulary is a good focus, and there are some other standard starting ontologies that provide a good starting base of predicates [6]. Then, as you work with your data and its requirements, you can later expand to more sophisticated relationships.

In taking this approach you will still see immediate benefits due to the value of connected data through the Linked Data Law [7]. But, at the same time, you will be embracing a simpler language to start and then gain fluency.

Takeaway Message #8: Use simple, well-defined and documented predicates (properties or attributes).

Takeaway Message #9: You are building a common language for the enterprise; do so purposefully.

Misconception: No Need for Expensive Up-front Engineering

All of these observations lead to the conclusion that upfront ontology development need not be expensive. Any consultant selling six-figure ontology development to businesses ought to be seriously challenged. Start small and focused. Frankly, a simple spreadsheet taxonomy or quick conversion of existing XML or metadata or vocabulary standards is A-OK to get started.

Takeaway Message #10: Start small with stakeholders to build acceptance and best practices.

Takeaway Message #11: Start immediately to organize and federate existing information.

Misconception: No Need to Reinvent the Wheel

While it is true that the usefulness of ontologies as advocated by Structured Dynamics is greater than other constructs, these ontologies still just represent a more capable representation of knowledge structures that have been around in various other forms for years. For decades enterprises have created schema, taxonomies, controlled vocabularies, standards, and other knowledge structures that represent untold time, dollars and effort. It would be a waste to not fully leverage these sunk investments.

Further, many ontologies and interoperable structures also exist external to the enterprise, many open source and freely available. And, even if not all are already in proper ontological form, like internal structures these other constructs can be relatively easily leveraged and turned into ontology-ready form.

So, what we are doing with adaptive ontologies is not creating new structures or new representatiions from scratch, but leveraging the expressions of our current world views. These have been hard-earned, codified over years of effort, and are legacy expressions of the enterprise’s knowledge base.

In this vein, then, there is already much richness available to any organization upon which to embark on their ontology efforts. Use them, and gain great leverage.

Takeaway Message #12: Aggressively mine and re-use existing knowledge and structure.

Takeaway Message #13: Leverage and re-use appropriate portions of the “best” existing, external ontologies.

Misconception: No Requirement to Displace Existing Assets

Continuing in this same spirit, it is a mistake to see adaptive ontologies and the associated systems advocated by Structured Dynamics as a replacement for existing data assets. Rather, the idea and advantage is to keep data records in situ as much as possible. These are already performing investments that can be left largely as is. The role of the adaptive ontologies is to act as a federation layer that bridges across these existing assets.

This leverage of existing data assets can occur via the architecture of the system (generally Web-oriented architecture [8]) and a design of the data system and structures providing proper allocation between the ABox and TBox [1].

All of this maintaining of existing assets is aided by the ability to convert in-place data to ontology-ready RDF form. This is a separate topic in its own right and one I discuss elsewhere [9]. There is also a need to make sure that the attributes of the underlying instance records (generally, the columns within a relational table) are also properly modeled within the adaptive ontology. This is part of the best practices guidelines.

Of course, how much of the existing assets can be leveraged “as is” and what degree of modification or conversion might be necessary needs to be evaluated on a case-by-case basis. Generally, however, these mappings can be pretty straightforward and leave in place all existing hardware, software and administration procedures.

Takeaway Message #14: Leverage your existing databases as rich sources of instance records (“ABox”).

Takeaway Message #15: Explicitly design your TBox ontologies to be an interoperability layer over these existing record stores.

Takeaway Message #16: Reconcile the semantics across the enterprise’s data stores at this interoperable TBox layer.

Misconception: No Closed World Assumptions

A closed world assumption holds that any statement that is not known to be true is false. Most enterprise database and transaction systems are based on this premise. It works well where there is complete coverage of the entities within a knowledge base, such as the enumeration of all customers or all products of an enterprise.

Yet, in the real (“open”) world there is no guarantee or likelihood of complete coverage. Thus, under an open world assumption the lack of a given assertion or fact being available neither implies whether that possible assertion is true or false: it simply is not known.

An open world assumption is one of the key factors for enabing adaptive ontologies to grow incrementally. It is also the basis for enabling linkage to external (and surely incomplete) datasets.

In fact, systems designed around the open world assumption can still achieve closed world reasoning where the circumstances and completeness of the knowledge base permit. But, rather than being a logical outcome of the framework, such completeness axioms need to be explicitly stated. Thus, open world systems can achieve the same ends as closed ones where applicable, but with greater flexibility and extensibility.

Takeaway Message #17: No enterprise is an island; design according to the open world assumption.

Misconception: No Restriction to a Dedicated Priesthood

Consultants make their money and academics their reputation by often making things more obscure and jargon-laden than they need be. Ontologies — heck, even the name itself — is no exception.

But what we have laid out as general guidelines herein and their reduction to practice does not require a priesthood. Sure, there are some things to learn and some practices to follow, but these are certainly easier to understand and master than, say, a programming or scripting language. Adaptive ontologies done right can be a participatory activity within most any organization.

Some guidance and mentoring would certainly be helpful. Make sure to pick the right individuals that truly embrace these perspectives.

Also helpful would the assistance of groups skilled in team building and group participation [10].

Takeaway Message #18: Engage all knowledge stakeholders in ontology creation, review and refinement.

Takeaway Message #19: Use selected ontology engineers to help ensure consistency, but not necessarily structure.

Design for Data-driven Apps

The above addresses misconceptions related to how the market perceives current ontologies or how some advocates push the concept. But there are some unique perspectives that Structured Dynamics brings to ontology development specific to the purpose of data-driven applications. From a best practices standpoint, these considerations should also be included.

In order to properly “drive” applications and user interfaces and reports, specific design attention needs to be give to:

  • Linked data, and the use and accessibility of URIs as resource identifiers
  • Context- and instance-sensitive data display, including templates, and
  • Driving user interfaces via the inclusion of preferred and alternate labels in the ontology.

Of course, there are other considerations that come to bear. But these lend themselves to some rather simple checklist guidelines during ontology development and maintenance.

Takeaway Message #20: Follow some relatively straightforward best practices to gain all of the advantanges of adaptive ontologies.

This post is part of an occasional AI3 series on ontology best practices.

[1] We use the reference to “TBox” in accordance with our working definition for description logics:

"Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts."
[2] Chemicals, petroleum and pharmaceuticals are renowned for large-scale, vertical ontologies. Examples of general or upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc, BFO (Basic Formal Ontology) and UMBEL (Upper Mapping and Binding Exchange Layer). Many of the large exemplar ontology projects are funded under EU auspices; see write-ups for the 7th ICT (Information and Communications Technologies) program for the EU and prior ICT projects for more information.
[3] See, for example, my posting on When is Content Coherent? from about one year ago.
[4] See, for example, the Stanford KSL discussion on What is an Ontology? One part of that document explains ontological commitments as “agreements to use the shared vocabulary in a coherent and consistent manner,” which is benign enough. But other discussions and venues imply much more viz. the “commitment” term. This same Stanford source is also a useful for general philosophical discussions of ontologies.
[5] With respect to corporate taxonomies, see for example, Trish O’Kane, “United by a Common Language: Developing a Corporate Taxonomy“. Information Management Journal. FindArticles.com. 15 Aug, 2009. http://findarticles.com/p/articles/mi_qa3937/is_200607/ai_n17176092/.
[6] Some of the standard starting vocabularies that Structured Dynamics recommends include many of the ones listed on this useful ontology table from Freebase, and specifically include Dublin Core, Friend-Of-A-Friend (FOAF), GeoNames, SIOC, SKOS, RDF Schema, XML Schema, OWL, UMBEL, and BIBO. These are typically supplemented with domain-specific ontologies appropriate to the scope at hand.
[7] The Linked Data Law states the value of a linked data network is proportional to the square of the number of links between data objects. It is a derivative of Metcalfe's law, which states that the value of a telecommunications network is proportional to the square of the number of users of the system (n2), where the linkages between users (nodes) exist by definition. For information bases, the data objects are the nodes. Linked data works to add the connections between the nodes. This concept was first presented in ago in What is Linked Data? and then formalized in [9].
[8] In WOA, discrete functions are packaged into modular and shareable elements (services), then made available in a distributed and loosely coupled manner using Representational State Transfer. REST provides principles for how resources are defined and used with simple interfaces without additional messaging layers. REST is a foundation to the HTTP protocol and a key reason for the success and scalability of the Web.
[9] See further my posting, Structure the World.
[10] As a matter of full disclosure, Structured Dynamics does not have expertise nor strengths in these areas.
Posted:August 3, 2009

The "Blue Marble": The Earth seen from Apollo 17.jpg from Wikipedia.orgMultiple Techniques and Data Structs can Make the Vision a Reality

Linked data and subject and domain ontologies provide the organizing framework. Techniques for converting, tagging and authoring structure provide the content. In combination, we now have in hand the necessary pieces to enable all of us to “structure the World.”

In this vision, the nature of the links or connections between data need not be complicated to gain tremendous benefit. Similar to Metcalfe’s Law for the increasing value of networks as more nodes (users) get added, adding connections to existing data is a powerful force multiplier.

We can call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects [1]. Further, if we are purposeful to include connective links where appropriate as we add more data (that is, nodes), this multiplier effect becomes even stronger.

Structured Dynamics is dedicated to help make this prospect real. Meaningful progress in doing so requires only a relatively few moving parts or techniques. Yet, because we sometimes bounce from talking or focusing on one part versus the others, we can lose context or sight of the overarching vision. The purpose of this article is to re-set and calibrate that overall vision.

The Vision: Data Federation of Any Desired Content

The vision is to get all data and information to interoperate, regardless of legacy or form. Much of this data is already structured, either from databases or simpler forms of data structs. Some of this information is unstructured or semi-structured, requiring extraction and tagging techniques. And new information is being constantly generated, which warrants better means to author and stage for interchange and interoperability.

No matter the provenance, all information has context and scope. As a chunk from here, and a piece from there, gets added to our linked data mix, having means to characterize what that data is about and how it can be meaningfully inter-related becomes crucial. Sometimes these contexts are informed by existing schema; sometimes they are not. But, in any case, it is the role of ontologies to both position these datasets into an “aboutness” framework and to help guide how the data can be described and related to other data. This part of the vision invokes semantics and coherent structures (schema or ontologies) for positioning and mapping datasets to one another.

As both the means for representing any extant data format and as the means for describing these conceptual relationships or schema, RDF provides the canonical data model. A single target representation and common data model also means we can develop and design a smaller universe of tools to operate and provide functionality over all of this data. Indeed, because our RDF data model and its ontologies are so richly structured, we can design our tools with generic functionality, the specific operation and expression of which is based on the inherent structure within the data and its relationships. This vision of data-driven apps leads to extreme leverage, incredible flexibility, and inherent “meshup” capabilities for tools.

Further, because we use Web identifiers (URIs) for our data and concepts and because we expose and access this linked data via the Web, we use the proven and scalable architectures of the Web itself for how we design our systems. This Web-oriented architecture (WOA) provides a completely decentralized and loosely coupled deployment model that can work ranging from public and open to private and proprietary, applicable to data and participants alike.

From the outset, it is essential to recognize that thousands of contributors are enabling this vision. So, while Structured Dynamics naturally uses its own tools and techniques to flesh out the various parts of this vision below, realize there are many players and many tools from which to choose [2]. For that is another aspect of this vision that is quite powerful: providing choice and avoiding lock-in.

RDF: The Canonical Data Model

The core construct — or fulcrum, if you will — of the vision is the RDF (Resource Description Framework) data model [3]. I have written elsewhere on the Advantages and Myths of RDF, which explains more precisely the advantages of that model. RDF provides a common data model to which any external format or schema can be converted and represented. It also provides a logic model and basis for building vocabularies that can inform and drive generic tools.

In the context of data interoperability, a critical premise is that a single, canonical data model is highly desirable. Why?

Simply because of 2N v N2. That is, a single reference (“canon”) structure means that fewer tool variants and converters need be developed to talk to the myriad of data formats in the wild. With a canonical data model, talking to external sources and formats (N) only requires converters to and from the canonical form (2N). Without a canonical model, the combinatorial explosion of required format converters becomes N2 [4].

Note, in general, such a canonical data model merely represents the agreed-upon internal representation. It need not affect data transfer formats. Indeed, in many cases, data systems employ quite different internal data models from what is used for data exchange. Many, in fact, have two or three favored flavors of data exchange such as XML, JSON or the like. More on this is discussed in a section below.

As this diagram shows, then, we have a single internal representation that is the target for all data and format converters and upon which all tools operate. These tools are themselves expressed as Web services so that they may be distributed and conform to general WOA guidelines. In addition, there may be multiple external “hubs” that represent alternative data models or formats or schema conversions (say, for relational databases). So long as we have converters between these alternate “hubs” and our canonical RDF form we can allow a thousand flowers to bloom:

Other canonical forms could be advocated. Yet RDF has the logical basis to represent any data form and any schema or conceptual structure. It is based on a robust set of open standards and languages and tools. It may be serialized in many formats. It can be grounded in description logics and, in appropriate forms, reasoned over and expressed in vocabularies and schema suitable for the most complex of conceptual structures and semantics. RDF is the data model explicitly designed for the Web, the clear global information basis for the foreseeable future.

For more than 30 years — since the widespread adoption of electronic information systems by enterprises — the Holy Grail has been complete, integrated access to all data. With the canonical RDF data model, that promise is now at hand.

Conversion: So Many Structs, So Little Time

Diversity is a truism of human communications as captured by the biblical Tower of Babel and the many thousands of current human languages. Diversity in data formats, serializations, notations and languages is a similar truism. We term the expression of each of these varied forms of data a struct.

While an internal canonical representation of data makes sense for the reasons noted above, pragmatic information systems must recognize the inherent diversity and chaos of data in the real world. The history of trying to find single representations or to impose standards via fiat have singularly failed. That will continue to be so due in part to inertia and legacy, sunk investments, existing infrastructure, and the purposes for the data.

In pursuing a vision of data interoperability, then, conversion is an essential glue for cementing understanding with what exists and will exist.

RDB-to-RDF

Arguably the largest source of structured data are enterprise and government information systems, with the predominant data representation being the relational data model managed by relational schema. Much of this data is also cleaner and mission critical compared to other sources in the wild. Fortunately, there are many logical and conceptual affinities between the relational model and the one for RDF [5].

Just as there are many RDFizers for simpler forms of data structs (see next), there are also nice ways to convert relational schema to RDF automatically. Given these overall conceptual and logical affinities the W3C is also in the process of graduating an incubator group to an official work group, RDB2RDF, focused on methods and specifications for mapping relational schema to RDF.

Amongst all techniques covered in this paper, Structured Dynamics views the layering of RDF ontologies over existing relational data stores as one of the most promising and important. Given the advantages of RDF for interoperability, this area should be a major emphasis of current and new vendors and service providers.

RDFizers

Much data, however, resides in much smaller datasets and often for less formal purposes than what is found in enterprise databases. Some of this data is geared for exchange or standardization; much is emerging from Web and Internet applications and uses; and much might be local or personal in nature, such as simple lists or spreadsheets.

RDF is well suited to convert (“RDFize”) these simpler and more naïve data formats. In my original census about 18 months ago, as reported in ‘Structs’: Naïve Data Formats and the ABox, I listed about 90 converters. My most recent update now lists nearly double that number, with about 150 converters [6]:

URN handlers (in addition to IRI and URI):

  • DOI
  • LSID
  • OAI

RDF

  • Serialization formats:
    • N3
    • RDF/XML
    • Turtle
  • Languages and ontologies:
    • AB Meta
    • Annotea
    • APML
    • AtomOWL
    • Bibliographic Ontology
    • Creative Commons
    • EXIF
    • FOAF
    • Java
    • Javadoc
    • MARC/MODS
    • Meta Standards
    • Music Ontology
    • Natural Language
    • Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
    • Open Geospatial
    • OWL
    • SIOC
    • SIOCT
    • SKOS
    • UMBEL
    • vCard
    • XML
    • Others
  • (X)HTML pages
  • Embedded Microformats and GRDDL [7]:
    • DC
    • eRDF
    • geoURL
    • Google Base
    • hAudio
    • hCalendar
  • Embedded Microformats and GRDDL (con’t):
    • hCard
    • hListing
    • hResume
    • hReview
    • HR-XML
    • Ning
    • RDFa
    • relLicense
    • SVG
    • XBRL
    • XFN
    • xFolk
    • XR-XML
    • XSLT
  • Syndication Formats:
    • Atom
    • OPML
    • OCS
    • RSS 1.1
    • RSS 2.0
    • XBEL (for bookmarks)
  • REST-style Web service APIs:
    • Amazon
    • Apple
    • Calais
    • CrunchBase
    • Del.icio.us
    • Digg
    • Discogs
    • Disqus
    • eBay
    • Facebook
    • Flickr
    • Freebase (MQL)
    • FriendFeed
    • Garmin
    • Get Satisfaction
    • Google
    • Hoover’s
    • HTTP (raw)
    • ISBN DB
    • Last.fm
    • Library Thing
    • Magnolia
  • REST-style Web service APIs (con’t):
    • Meetup
    • MusicBrainz
    • New York Times
    • New York Times Campaign Finance (NYTCF)
    • New York Times tags
    • Open Library
    • Open Social
    • Open Street
    • OpenLink (facets)
    • O’Reilly
    • Picasa
    • Radio Pop (BBC)
    • Rhapsody
    • Salesforce
    • Slideshare
    • Slidy
    • Technorati
    • They Work For You
    • Twine
    • Twitter
    • Weather
    • Wikipedia
    • World Bank
    • Yahoo! Finance
    • Yahoo! Maps
    • Yahoo! Weather
    • YouTube
    • Zemanta
  • Files (multitude of file formats and MIME types, including):

Many of the sources above come from new and emerging Web-based APIs, which are also huge sources of content growth. Also note that alternative formats to RDF (e.g., microformats) or leading serializations and encodings (e.g, XML, JSON) also have many converter options.

For many typical naïve data structs, the data is represented as attribute-value pairs, which easily lend themselves to conversion to RDF as instance records [8]. See further the Authoring section below.

Tagging: The 80% Solution

An apocryphal statistic is that 80% to 85% of all information resides in unstructured text [9]. Besides lacking recent validation, this claim from a decade ago often attributed to Merrill Lynch also precedes much of the Internet and the emergence of metadata and tagging. Nevertheless, what is true is that written text content is ubiquitous and the majority of it remains untagged or uncharacterized by any form of metadata.

While such information can be searched, it only matches when exact terms match. This means that related information, particularly in the form of conceptual relationships and inferencing, can not be applied to untagged text content.

While information extraction — the basis by which tags for entities and concepts can be obtained — has been an active topic of research for two decades, it is only recently that we have begun to see Web-scale extractors appear. Examples include Yahoo’s term extractor, Thomson Reuter’s Calais, or Google’s Squared, to name but a few.

scones - Subject Concepts or Named Entities In Structured Dynamics’ case we have been working on the scones (Subject Concepts Or Named EntitieS) extractor for quite a while. scones uses rather simple natural language processing (NLP) methods as informed by concept ontologies and named entity (instance record) dictionaries to help guide the extraction process. The co-occurrence of matches between concepts and entities also aids the disambiguation task (though additional modules may be invoked with alternative disambiguation methods). In prototype forms, the resulting tags can be managed separately or fed to user interfaces or re-injected back into the original content as RDFa.

There are literally dozens of such extractors and services presently available on the Web and many that are available as open source or commercial products. Some are mostly algorithm based using machine-learning techiques or statistics, while others are gazeteer- or dictionary-driven.

These systems will lead to rapid tagging of existing content and the removal of some of the early “chicken-and-egg” challenges associated with the semantic Web. These systems will also be combined with the many existing bookmarking and tagging services.

So, just as we will see federation and interoperability of conventional data, we will also see linkages to relevant and supporting text content accompanying it. This combination, in turn, will also lead to richer browsing and discovery experiences.

Authoring: The Neglected Third Leg of the Stool

In addition to conversion and tagging, authoring is the third leg of the stool to expose structured data. It is a neglected leg to the structured content stool, and one important to make it easier for datasets to be easily exposed as RDF linked data.

One of the reasons for the proliferation of data structs has been the interest in finding notations and conventions for easier reading and authoring of small datasets. There have literally been hundreds of various formats proposed over decades for conveying lightweight data structures. Most have been proprietary or limited to specific domains or users. Some, such as fielded text, structured text, simple declarative language (SDL), or more recently YAML or its simpler cousin JSON, have become more widely adopted and supported by formal specifications, tools or APIs. JSON, especially, is a preferred form for Web 2.0 applications.

What has been less clear or intuitive in these forms, again mostly based on an attribute-value pair orientation, is how to adequately relate them to a more capable data model, such as RDF. In JSON or YAML, for example, the notations include the concepts of objects, arrays and datatypes (among other conventions). Other structures lack even these constructs.

To take the case of JSON as might be related to RDF, there are a couple of efforts to define representation conventions from Talis and GBV for serializing RDF. There was a floated idea for an RDF version of JSON called RDFON that has now evolved into the TURF approach. JDIL (JSON data integration layer) instructs how to add namespaces to JSON to enable encoding RDF. Jim Ley, Kanzaki Masahide and Dave Beckett (likely among others) have written simple and straightforward RDF and Turtle parsers and converters for JSON. And, still further examples are Beckett’s Triplr and Sören Auer‘s ASKW Triplify lightweight conversion services involving many different formats.

Because JSON is easily readable, can drive many Web 2.0 applications and widgets, and lends itself to fast conversions and tools in various scripting languages, Structured Dynamics was commissioned by the Bibliographic Knowledge Network (BKN) to formalize a BibJSON specification suitable for BibTeX-like data records and citations with an extensible schema to be converted to RDF.

The emerging result of that BibJSON effort will be published shortly. The specification includes conventions and vocabularies for creating bibliographic and citation instance records, for specifying structural schema, and for creating linkage files between the attributes in the record files with existing and new schema. BibJSON is itself grounded in IRON, which is an instance record and object notation developed by Structured Dyamics that can be serialized as JSON (called irJSON), XML (called irXML) or comma-separated values (or CSV comma-delimited files, called commON).

The purpose of these notations and serializations is to provide easier authoring environments and scripting support to RDF-ready datasets. This approach has the advantage of shielding most users from the nuances or lengthiness of RDF (though the N3 serialization also works well).

The design and development of commON was especially geared to using spreadsheets as authoring environments that would enable easy creation of instance record tables or simple hierarchical or outline structures. For example, here is a sample portion of Sweet Tools specified in a spreadsheet using the commON notation:

Once the philosophy and role of naïve data structs is embraced — with an appreciation of the many converters now available or easily written for translating to RDF — it becomes easier to determine data forms appropriate to the tools and natural work flow of the users and tasks at hand. Under this mindset, the role of RDF is to be the eventual conversion target, but not necessarily what is used for intermediate work tasks, and in particular not for authoring.

Getting it All Organized

OK, so now all of this stuff is converted, tagged or authored. How does it relate? What is the relation of one dataset to another dataset? Is there a context or framework for laying out these conceptual roadmaps?

UMBEL (Upper Mapping and Binding Exchange Layer) Two years ago as we looked at the state of RDF and the incipient semantic Web as promised via linked data, we saw that such a specific framework was lacking. (Though there were existing higher-level ontologies, either their complexity or design were not well-suited to these purposes.) It was at that time that Frédérick Giasson and I began to formulate the UMBEL (Upper Mapping and Binding Exchange Layer) ontology, which eventually led to our more formal business partnership and Structured Dynamics.

What we sought to achieve with UMBEL was a coherent reference framework of about 20,000 subject concepts, connected and acting like constellations in the information sky for orienting content and new datasets. At the same time, we wanted to create a general vocabulary and approach that would lend themselves to creation of domain-specific ontologies, which would also naturally tie in and inter-relate to the more general UMBEL structure.

This objective was achieved, though UMBEL deserves an upgrade to OWL 2 and some other pending improvements. A number of domain ontologies have been created and now relate to UMBEL. So, rather than being an end to itself, UMBEL was one of the necessary infrastructure pieces to help make the vision herein a reality.

Similar approaches may be taken by others with new domain ontologies based on the UMBEL vocabulary with tie-in as appropriate to existing subject concepts, or by mapping to the existing UMBEL structure.

Of course, UMBEL is not an absolute condition to the vision herein. However, insofar as users desire to see multiple datasets inter-related, including the use of existing public Web data, something akin to UMBEL and related domain ontologies will be necessary to provide a similar roadmap.

Making it All Available

The parts and techniques discussed so far pertain almost exclusively to data and content. But, these structures so created now can inform data-driven applications which also now must be deployed. To do so, Structured Dynamics is committed to what is known as a Web-oriented architecture (WOA):

WOA = SOA + WWW + REST

WOA is a subset of the service-oriented architectural style, wherein discrete functions are packaged into modular and shareable elements (“services”) that are made available in a distributed and loosely coupled manner. WOA generally uses the representational state transfer (REST) architectural style defined by Roy Fielding in his 2000 doctoral thesis; Fielding is also one of the principal authors of the Hypertext Transfer Protocol (HTTP) specification.

REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it.

structWSF Web Services FrameworkWithin this design we need a suite of generic functions and tools that are driven by the structure of the available datasets. The deployment vehicle and design we have implemented to provide this WOA design is structWSF [10].

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies). The master or controlling Web service in the framework is the module for granting access and use rights to datasets based on permissions.

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. More services can readily be added to the system.

All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and a document of resultsets (if the query result is not null). Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

In initial release, structWSF has direct interfaces to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. The framework is open source (Apache 2 license) and designed for extensibility.

No End in Sight

Like all visions, there are many aspects and many improvements possible. This vision is definitely a work-in-progress with no end in sight.

But, meaningful movement embracing the full scope of this vision is doable today. Structured Dynamics welcomes inquiries regarding any of these aspects, improvements to them, or application to your specific needs and problems.

We also welcome you to come back and visit our blogs (Fred’s is found here). We try to speak on various aspects of this vision in all of our posts and are pleased to share our experience and insights as gained.


[1] Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n2), where the linkages between users (nodes) exist by definition. For information bases, the data objects are the nodes. Linked data works to add the connections between the nodes. We can thus modify the original sense to become the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between the data objects. I first presented this formulation about a year ago in What is Linked Data?
[2] This piece introduces for the first time a couple of efforts-in-progress by Structured Dynamics. For a general tools listing, see my own Sweet Tools listing of about 800 semantic Web and -related tools.
[3] As quoted in The Lever, “”Archimedes, however, in writing to King Hiero, whose friend and near relation he was, had stated that given the force, any given weight might be moved, and even boasted, we are told, relying on the strength of demonstration, that if there were another earth, by going into it he could remove this.” from Plutarch (c. 45-120 AD) in the Life of Marcellus, as translated by John Dryden (1631-1700).
[4] The canonical data model is especially prevalent in enterprise application integration. An interesting animated visualization of the canonical data model may be found at: http://soa-eda.blogspot.com/2008/03/canonical-data-model-visualized.html.
[5] An excellent piece on those relations was written by Andrew Newman a bit over a year ago; see Andrew Newman, 2007. “A Relational View of the Semantic Web,” published on XML.com, March 14, 2007; http://www.xml.com/pub/a/2007/03/14/a-relational-view-of-the-semantic-web.html. RDF can be modeled relationally as a single table with three columns corresponding to the subject-predicate-object triple. Conversely, a relational table can be modeled in RDF with the subject IRI derived from the primary key or a blank node; the predicate from the column identifier; and the object from the cell value. Because of these affinities, it is also possible to store RDF data models in existing relational databases. (In fact, most RDF “triple stores” are RDBM systems with a tweak, sometimes as “quad stores” where the fourth tuple is the graph.) Moreover, these affinities also mean that RDF stored in this manner can also take advantage of the historical learnings around RDBMS and SQL query optimizations.
[6] The largest source for RDFizers, which it calls Sponger cartridges, is from OpenLink Software in relation to its Virtuoso universal server. Most of its converters use XSLT stylesheets to translate to RDF, but the system has other conversion capabilities as well. Two additional OpenLink resources are a clickable diagram of converters and relationships with links and an online storehouse of available XSLT converters. In addition, two other sources — the W3C’s Semantic Web wiki with converter listings and MIT’s Simile program and listing of RDFizers — have a rich set of listings. Note that many of the categories shown on the table also have multiple sources of converters, so that the absolute number of converters has also grown faster than the unique formats supported.
[7] GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a W3C markup format for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT GRDDL accomodates a wide variety of dialects (see one listing) and can be combined with arbitrary transformation mechanisms (though currently mostly based on XSLTs).

[8] We characterize instance records as representing the “ABox”, in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[9] One of the more recent discussions of this percentage is by Seth Grimes, Unstructured Data and the 80 Percent Rule, 2009.
[10] structWSF is also designed to integrate with third-party apps and content management systems (CMSs) to provide the user interfaces to these functions. The first implementation of this design is conStruct SCS, a structured content system that extends the basic Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.
Posted:June 30, 2009

Random Colour Swirl photo courtesy from PD Nathan at PhotobucketInteroperable Naïve Data Structs, Datasets and Canonical RDF

As I noted in my review of SemTech 2009, one of the key themes of the conference was data federation. Unfortunately, data federation has been a term a bit out of vogue for a while. (Though I still think it best captures the space.)

The current vernacular has been pushing forward an alternative: data mixing. One of the larger product pushes at the conference was by Zepheira for its new Freemix service and product. Freemix is a hosted service largely built around the Exhibit data display application, aided by some tools to make creating an exhibit easier. Exhibit is an attractive presentation system; for nearly three years AI3‘s own Sweet Tools dataset listing of semantic Web and -related tools has been presented via Exhibit.

Freemix looks promising and is now being offered in beta. But one thing caught my ear when listening to the company’s announcement: they are not yet able and ready to show the “data mixing” part of the system. Its release is apparently being delayed until later this year because of the difficulties encountered.

This post coincides with the release of the alpha version of the structWSF code on the OpenStructs Web site. It is available for download under Apache 2 license.
We’ll be blogging a few more times in the coming days regarding other possible uses and applications for this platform-independent Web services framework.

What is Data Mixing and Why is it So Hard?

As a new term there is no “official” definition of data mixing. However, I think we can consider it as generally equivalent to the older data federation concept.

Data federation is the bringing together of data from heterogeneous and often physically distributed data sources into a single, coherent view. Sometimes this is the result of searching across multiple sources, in which case it is called federated search. But it is not limited to search. Data federation is a key concept in business intelligence and data warehousing and a driver behind master data management (MDM).

As I first wrote about data federation about five years ago [1]:

Data federation first became a research emphasis within the biology and computer science communities in the 1980s. At that time, extreme diversity in physical hardware, operating systems, databases, software and immature networking protocols hampered the sharing of data.
Yet it is easy to overlook the massive strides in overcoming these obstacles in the past two decades.

The Internet and its TCP/IP and Web HTTP protocols and XML standards in particular, have been major contributors to overcoming respective physical and syntactical and data exchange heterogeneities. The current challenge is to resolve differences in meaning, or semantics, between disparate data sources. Your “glad” may be someone else’s “happy” and you may organize the world into countries while others organize by regions or cultures.

Resolving semantic heterogeneities is also called semantic mediation or data mediation. Though it displays as a small portion of the pyramid above, resolving semantics is a complicated task and may involve structural conflicts (such as naming, generalization, aggregation), domain conflicts (such as schemas or units), or data conflicts (such as synonyms or missing values). Researchers have identified nearly 40 distinct types of possible semantic heterogeneities [2].

Ontologies provide a means to define and describe these different worldviews. Referentially integral languages such as RDF (Resource Description Framework) and its schema implementation (RDF-S) or the Web ontological description language OWL are leading standards among other emerging ones for machine-readable means to communicate the semantics of data.

Fortunately, we have climbed most of this data federation pyramid. The stumbling block now are the semantics. This is made all the harder when we place too much burden on the data transmission or “packet” itself. In other words, does exchange also carry with it the burden of meaning? The rest of this post tries to explain what I mean by this and how it relates to our new structWSF Web services framework.

Is it Apples or Oranges?

Not to pick on any one thing or any individuals, but three recent threads on semantic Web-related mailing lists help illustrate in various ways some interesting mindsets. While there is much on each of these threads of other value, I’m only focusing on a narrow topic from each based on my thesis at hand.

And, what is that thesis? It is simply that we too often mix instance record and attribute assertions with schema representations and world views. And, when we do, we sometimes make mountains out of molehills (or mix apples and oranges to completely mix metaphors).

Example 1: Squeezing RDF into JSON

JSON (JavaScript Object Notation) is a data notation or syntax, easily created and widely used for current Web apps. It has a rather simple syntax for representing attribute-value pairs. Many useful tools and parsers for the serialization exist.

In keeping with his general and broad criticisms of how the semantic Web standards and approaches have been promulgated by the W3C to date, John Sowa most recently expressed his ideas in a posting to the ontolog-forum mailing list under the heading of ‘Semantic Systems’ [3]. In this thread, John proposes:

1. The recommended exchange form for RDF will become JSON. Any JSON documents that are limited to triples can use the old XML-based RDF form, but they can also use the more compact and more general full JSON.

Then, in a subsequent posting to that thread he notes:

5. The W3C made a major blunder with a one-size-fits-all approach that tried to use a document tagging language as a knowledge representation language. The result was the *worst* notation for logic ever invented.

Finally, he goes on to note in a further post:

JSON could be used as an alternative to XML for the syntax, but the lack of a standard semantics for JSON means that it could *not* be used as a replacement for RDF *unless* an official standard were adopted for mapping RDF to and from a particular subset of JSON whose semantics was defined in Common Logic.

All of this John proposes in the spirit of:

The goal of my proposal is nothing less than a total *integration* of the Semantic Web methodologies with the methodologies that have been used in the traditional software development community [3].

I find common ground with a couple of the ideas in this proposal. First, accepted formats like JSON should have a prominent place in data exchange. Second, leveraging methodologies used in the traditional community is definitely a good thing.

But John, while suggesting reuse of existing traditions, is also paradoxically recommending a wholesale replacement for RDF. He is also positing a single exchange standard (JSON). And, he stops tantalizingly short of recognizing an important truth that I’m sure he knows: simple instance record assertions and representations — the essence of data exchange — can and should be viewed separately from schema representations.

As I have noted in my earlier naïve data ‘structs’ series, there are in fact scores of existing data transfer formats that have been adopted by their communities — and are likely to remain popular within those communities for some time — that can play a similar role to JSON. So long as the role of data exchange is kept to the assertions (“metadata”) about instances, many formats can play in the sandbox.

The role of RDF may or may not reside with data exchange. To conflate and equate RDF and JSON is to reduce the power of keeping instance record representations separate from schema and world view representations. John’s basic sensibilities, I think, could be more effectively promoted by not posing ‘either-or’ strawmen and recognizing that data exchange formats will ALWAYS be diverse and heterogeneous.

Observation: Existing and emerging data ‘structs’ useful to data exchange will remain manifest in format and diversity; data exchange imperatives are a different matter from schema and knowledge representation.

Example 2: RDFa is Not ‘Expressive’ Enough

Somewhat in contrast to this thread was a different one by Martin Hepp, editor of the excellent Good Relations ontology, on the LOD (linked open data) mailing list [4]. This thread, which sensibly questions how difficult it is for mere mortals to configure an Apache server to support publishing RDF, reached further into the realm of RDFa as a document annotation language.

As Hepp states,

The reason is that, as beautiful the idea is of using RDFa to make a) the human-readable presentation and b) the machine-readable meta-data link to the same literals, the problematic is it in reality once the structure of a) and b) are very different. For very simple property-value pairs, embedding RDFa markup is no problem. But if you have a bit more complexity at the conceptual level and in particular if there are significant differences to the structure of the presentation (e.g. in terms of granularity, ordering of elements, etc.), it gets very, very messy and hard to maintain.

Further discussion in this thread elaborates the interest in having the documents in which the RDFa is embedded carry much more schema-level information.

Like the Sowa case, this raises the question of where to draw the line. Should embedded metadata in documents carry complex schema information as well? So, we now shift the focus from data exchange to schema representation.

I think this is really unnecessary since it is quite easy in RDFa to refer to a separately specified schema. By, in this case, conflating metadata transfer and exchange with schema, the bar has been raised unnecessarily high.

If we need to capture schema and world views, fine, let us do so directly and succinctly. Then, let our document metadata (in this case using RDFa) make attribute assertions about that “payload” simply and cleanly. The Web certainly does not need individual documents carrying with them entire schema representational views of the world.

Observation: Data exchange, even based on RDF (via RDFa), is best kept to the assertions of facts and attributes.

Example 3: Mixing Vocabularies

In a microformats context, Thomas Loertsch posed some questions on mixing vocabularies [5] and how they should be interpreted. This caused an involved discussion of intent and possible implications and best practices, with discussants including Brian Suda, Peter Mika, Ben Ward and others. It also led to the start of a useful wiki page on how objects should be represented in Web pages when multiple microformats can be invoked.

For quite some time microformats, I think, have gotten the “mix” just about right. They have created well-reasoned attributes for distinct instance types and seek to keep their embedding of that information simple in existing documents. Some advocate while others question the rigor of the microformat structure; that is not the topic here.

What is interesting about this thread is that it evolved to discuss the implications and best practices when an author posts a document with more than one microformat. How do these vocabularies relate? How should we, as “consumers” of the document, parse the vocabularies?

Yahoo!’s SearchMonkey service has recognized microformats for some time, and its questions regarding interpretation and best practices in the thread were natural. But the interesting point that seemed to come out of this thread is that users will post microformats as they wish. While care and standards in the design of the microformats can help reduce confusion and conflict, it can not guarantee it. The final responsibility for proper ingest and processing likely resides with the aggregators and publishers that consume such data.

So, here, too, we have another case of asserting metadata and embedding for data exchange in a slightly different native format than RDF. Huzzah!

Observation: Standards setters and consuming agents (often aggregators, publishers or search engines) should take lead responsibility for best practices and processing attribute data, realizing that original authors and developers may not fully comply.

Revisiting the ABox and TBox Split

structWFSThese examples are a bit of a long way around the barn to reinforce what we have been arguing for some time: the need for a proper split between the ABox (assertions related to instances) and the TBox (concept relationships, schema and world views) [6]. This has been a pretty constant theme in our writing, ranging from first introductions, to its relation to description logics, relationships to existing data ‘structs’, and explicit discussion of ABox and TBox roles in a four-part series.

One of the key points throughout this writing is that an ABox-TBox mindset provides a context and rigor for looking at questions such as our three examples above. In all three cases, I argue, the seeming conundrums result from lacking this mindset. Once this mindset is applied, the respective roles of various data formats, RDF, schema and the like naturally fall into place.

Of course, the Web is also a dirty and chaotic place where niceties of design and best practices are routinely ignored or unknown or purposefully rejected. So be it. This is reality. This reality needs to be accommodated. But good design can help overcome it and work to establish resilient, flexible architectures.

Of course, even though this might be good design, there is no ability to enforce such distinctions across the Web. However, insofar as key implementors are concerned (standards writers, major publishers, tools developers, industry experts, and the like) we can put in place better approaches. This mantra is at the heart of all that Structured Dynamics does — including the structWSF Web services framework, just released as open source code.

A General Data Mixing Model

So, now we can finally turn our attention to the structWSF Web services framework, more broadly described here.

There are a number of perspectives and contexts to view this structWSF framework. In this posting, we take the boundary conditions of data formats and data exchange [7]. The key question for this perspective is: given the realities noted above, what is an adaptive framework for data mixing on the Web? Our schematic answer to this question is below:

The basic design has two key data considerations. First, all structWSF tools and Web services and schema work from the canonical RDF data model. It is the hub and common denominator for all structWSF installations. We are able to design and optimize generic tools and services (including converters) around this canonical framework.

Second, we assume most everything in the outside world to be non-compliant with this canonical model, with the data representations often naïve and incomplete. Converters (also known as translators or RDFizers) are an essential bridge to this external world, and need to be designed for re-use and extensibility.

Where the outside world is compliant, they conform to the structWSF APIs or are themselves structWSF installations. In these cases, direct data exchange and access with permission rights occurs at a dataset level (not shown).

The Naïve Part of the Spectrum

Converters are themselves bona fide Web services at the structWSF level. (Only a few are presently included in the alpha release.) While some may be one-off converters (sometimes off-the-shelf RDFizers), and often devoted to large volume external data sources, it is also helpful to emphasize one or more “standard” naïve external formats. A “standard” external format allows for a more sophisticated converter and enables specific tools to be more easily justified around the standard naïve format.

As noted above, this “standard” is often JSON or a derivative of JSON. But, just as readily, the common ‘naïve’ format could be SQL from relational databases or another format common to the community at hand. In many ways, because the emphasis of data exchange is on the ABox and instance records and assertions (and attribute extensions), the actual format and serialization is pretty much immaterial.

Emphasizing one or a few naïve external formats allows more tools and services to be cost-effectively developed for those formats. And, even though the format(s) chosen for this external standard may lack the expressiveness of RDF (and, ultimately, OWL), because the burden is principally related to data exchange, this layer can be readily optimized for the deployment at hand.

Besides import converters it is also important to have export services for the more broadly used naïve external formats. In fact, some structWSF services can be devoted to data cleanup or attribute (property) or object reconciliation (including disambiguation as a possibility). In this manner, structWSF installations could also improve the authority and trustworthiness of standard data in the wild.

Another common service for this naïve data is to give it unique URI identifiers and to make it Web-accessible, thus turning it into linked data.

The RDF Canonical Data Model

Such generic services are possible because the “highest common denominator” for the system is the canonical RDF model. Because it is the consistent basis for tools and services, once a converter is available and the external information schema is mapped to the internal structure, all existing tools and services are available for re-use. Moreover, this system and its datasets are now ready for sharing with other structWSF instances, within the enterprise or beyond.

Thus, we begin to see a network of canonical “hubs” in a sea of heterogeneity, the interoperation of which is facilitated by a structWSF framework at every network node. This design is discussed more in the next part of this series.

Some, such as Sowa noted above, would prefer a grounding in common logic (CL) as opposed to RDF. Our choice to use RDF is based on the simplicity and understandability of the data model, plus the richness of languages and standards from the W3C that surround the framework.

Even here, however, the RDF basis of structWSF need not be the final word. Because of a keen intent to keep all designs and ontologies used by structWSF firmly grounded in description logics, it is possible for the structWSF basis to be converted to other languages and frameworks such as CL that can be expressed in DL.

Bringing it Back to Data Federation

Data mixing — or more preferably, data federation — has as its heart the premise of heterogeneous and distributed data sources. It implicitly acknowledges differences in syntax, semantics and serializations.

The design and architecture of structWSF is similarly premised. While each of us may prefer one model or one format over others, we must interoperate in the real world. And that world, for many understandable and immutable reasons, will retain its diversity. Accepting this reality is a first step to adaptive design.

So, we control what we can control, and we adapt to what else exists. We have chosen RDF as the canonical data model that we can control and have embedded it in a Web services framework that is Web-based and scalable; in other words, a fully compliant Web-oriented architecture. These are the conceptual foundations to structWSF.

To be sure, structWSF in its current alpha release is quite raw in many areas and incomplete in others. But we will continue to work on it — and invite your participation to do the same — such that it can fulfill its destiny as a data federation framework for the Web.


[1] I first wrote about this while at BrightPlanet; a page is still up on that Web site with the text above. I have re-caste this material in various ways since.
[2] I have previously written on the “40 sources” of data heterogeneity. See here, for example.
[3] See http://ontolog.cim3.net/forum/ontolog-forum/2009-06/msg00210.html and continue to follow the noted thread.
[4] See the thread, ‘ .htaccess a major bottleneck to Semantic Web adoption,’ at http://lists.w3.org/Archives/Public/public-lod/2009Jun/0341.html and continue to follow this thread.
[5] See http://microformats.org/discuss/mail/microformats-discuss/2009-June/012985.html and continue to follow the ‘mixing vocabularies’ thread.

[6] This is our working definition of the ABox and TBox in specific reference to description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[7] For functionality, download, documentation or other direct materials on structWSF, please see OpenStructs.org and its related resources. There is also a Drupal instantiation of the system called conStruct, also available for download.
Posted:June 21, 2009

SemTech 2009

SemTech 2009 is Now the Gold Standard

OK, I admit it. I’m a dweeb and a suck-up. I have just returned from the Semantic Technology Conference in San Jose (CA) and I could not be more impressed. For real semantic Web action, this was the place to be. And, I’m sure, it will continue to be so if its leadership stays intact for some time to come.

I know, it is really not my style to applaud others when I could be patting myself on the back. But, hey, this was such a remarkable meeting in so many ways that I feel I have to break from precedent.

In terms of dislcosure let me be clear: I hope I get invited back to speak again next year (only this time in a bigger forum. Hint. Hint.) But, even if not invited to speak, me and my company will be there with bells on for this simple reason: this is the semantic Web confab that matters.

Unlike others that have noted specific talks, etc., I will not do so. Not because there were not talks that warrant such visibility; indeed, there were a tremendous number. My biggest regret of the meeting is that I could only taste a portion of all of the talks because so much was going on. But, rather than singling out any specific talks, I’d like to comment more thematically.

The conference organizers reported that more than 500 paper suggestions were put forward for what ended up being about 150 speaking slots. This was in addition to many tutorials, keynotes and many rump meetups. My best guess is that there were on the order of 1200 in total attendance. It was a packed five days or so. San Jose and the facilities were excellent.

The Real Message

When tech shows reportedly are down on average 40% or so from prior years, it is pretty remarkable to have one exceed its prior attendance, which SemTech 2009 apparently did. There were real customers, real use cases and real interest at every turn and in every conversation. Many hard problems were brought forward; some without acceptable solutions yet.

The real theme I kept hearing was: data federation, data federation, data federation. The potential for semantic Web technologies via the RDF data model and OWL and ontologies for finally breaking down the barriers between data silos was hammered and probed. The timing, I think, could not have been better than to have received shortly before the conference the timely PricewaterhouseCoopers technology quarterly report on linked data in the enterprise that I reviewed a couple of weeks back.

I think we can safely say that the advances from linked data in the past couple of years have been huge enablers and eye-openers to these prospects. But I also had the sense that the discourse is now moving beyond linked data as practiced so far. Web identifiers and Web access, I think, have won the argument. It is now time to move on to real data, interoperability and efficient tools and build-out.

Making the Pragmatic Prominent

To be sure there were discussions of more consumer-oriented apps and search. But the major energy and action seemed to center on the enterprise.

The idea is how can RDF bring us leverage, not replace what already exists. After 30 years of frustration, how can we finally solve the data federation problem? How can we remove the historical brittleness of applications and report writers? How can we actually begin to extract business intelligence from the massive data assets we already have at hand?

Asking enterprises to junk what they have for promises and prayers will not cut it. The winning strategy, and the challenge I kept hearing was: How can we layer on semantic technologies and RDF to bridge our existing data stores? How can we leave our RDBMs in place while gaining the goodness of ontologies and semantics?

We clearly see all around us the power of open source and the withering of proprietary apps and approaches. But, much data and information will remain private and needs to have access and rights restrictions. What answers does semantic technologies offer in these areas?

Then, as suppliers in this brave new world of open source and low software rents, what is the winning business model? Tom Tague helped articulate the importance of revenue models and options in his keynote; it resonated with already ongoing discussions in the hallways.

I’ve been in this space for more years than I care to admit. My observation from prior years is that some new “big thing” is identified, given blessing and push by the industry analysts (always with a new acronym), and then hyped like hell. Maybe it is the current challenged economic climate, but it feels like those days are over. For good.

Hype will not open wallets anymore. Case studies and real warts will help bring confidence that something truly different is at hand with semantic technologies. Our central challenge as suppliers to this market is to respond to today’s pragmatic imperatives. We must demonstrate more with less and faster. We must emphasize leverage and re-use. We must respect the trillions in already sunk IT assets.

Why This Matters to You

I think this matters much for three different communities.

For enterprises, I think it means that it is time for pilots and engagements. Both the market and the suppliers can not move this space forward rapidly without meaningful engagement. We’re ready, and it seems like many of you are as well. Push it with your bosses; we’ll deliver.

For the linked data community, where do you go next? I, too, heard some of the criticisms about too much “ontology.” But such discussion risks wasting the gains already achieved. If we do not listen and respond to the market’s imperatives and voice, we will become irrelevant. Let’s accept linked data as a tremendously helpful step in an ongoing progression, but continue to mature.

For some of the more established semantic technology providers, we have to make it simpler and faster. Expensive ontology development, too, will become irrelevant if we are indeed going to replace conventional software development with data-driven apps. Fortunately, I saw much, much exciting in this space and really had my eyes opened to tremendous innovation.

Outside of the venue, I heard from some of my prior Silicon Valley colleagues that this was the most constrained VC situation they have seen in decades. Funds may exist, but capital calls are not being made. What little powder there is, is being kept dry to triage existing investments. It is a good thing capital requirements for new start-ups have declined so much in recent years, because VCs are unlikely to fund the gap. And, aside from some big, prominent initiatives like data.gov or health care digitization, most savvy observers would bet that US and EU funding will also begin drying up in the coming years.

All of this can sound like bad news, but I think it is an opportunity: As technologists and suppliers, we must be relentlessly revenue focused and deliver what the market is demanding: more with less faster while preserving existing investments.

Five Stars: The Craft of Conference Organizing

The organizers from Wilshire Conferences and their entire staff did an absolutely tremendous job. Tony Shaw, Eric Franzon, Steve Bastasini and Eric Hoffer (I know Eric, you were only pinch hitting), plus the many on-site staff, were uniformly professional and unobtrusive. Sally Khudairi on PR and the A/V and registration crew were also excellent.

I once had responsibility for an annual technical meeting that averaged more than 2000 attendees and 150 exhibitors and I appreciate how many moving parts there are behind the scenes. Things work when nothing gets noticed. My guess is few noticed any issues or problems at this conference.

The stated aim on the intro slide to each session was to educate, and the agenda certainly achieved that. A/V was professional; time was kept; coffee did not run out; wi-fi glitches were quickly solved.

Sure, like any business, there is some pay-to-play in such conferences. Big sponsors get more slots and visibility. This reality, however, was also well balanced with new voices and innovative presenters. My “to do” notes and contacts resulting from the conference will take quite some time to work through.

One of the things I really appreciated was how the time slots and composition of talks and activities were varied each day. I have not attended a meeting before that did such a good job of mixing the schedule up to keep things feeling fresh over so many intense days.

Much, thanks, folks, for a conference exceedingly well done.

Last Thoughts

If I had to note a quibble I guess it would be to start the conference with more challenges and innovations. While the tutorials are very helpful, the first opening talks, I hope, could not be quite so introductory in nature. I think things are maturing fast. But, I could be wrong. First-time attendees from the marketplace should probably guide how such events start ramping up the engine.

As I noted, I and Structured Dynamics will be back next year, and hopefully contribute in more ways as well. The venue for SemTech 2010 has changed to the Hilton in San Francisco on June 21 – 25.

So, start saving for your travel budget now. This is “must see” semantic Web. And I look forward to seeing you there in a year!