The New Paradigm of ‘Substantive Marketing’ for Innovative ITThis decade has clearly marked a sea change in the move of enterprise software from proprietary to open source, as I have recently discussed [1]. It is instructive that only a mere six years ago I was in heated fights with my then Board about open source; today, that seems so quaint and dated.![World's Tallest Flagpole; see ref [9] World's Tallest Flagpole; see ref [9]](../wp-content/themes/ai3/images/2011Posts/110815_tallest_flagpole.jpg)
Also during this period many have noted how open source has changed the capital required to begin a new software startup [2]. Open source both provides the tooling and the components for cobbling together specialty apps and extensions. Six and seven and even eight figure startup costs common just a decade ago have now dropped to four or five figures. When we see the explosion of hundreds of thousands of smartphone apps we are seeing the glowing residue of these additional sea changes. Dropping startup costs by one to three orders of magnitude is truly democratizing innovation.
But something else has been going on that is changing the face of enterprise software (besides consolidation, another factor I also recently commented on). And that factor is “marketing”. Much less commentary is made about this change, but it, too, is greatly lowering costs and fundamentally changing market penetration strategies. That topic — and my personal experience with it — is the focus of this article.
This Friday brown bag leftover was first placed into the AI3 refrigerator on August 15, 2011. This reprise is unchanged from its original posting and still describes how Structured Dynamics undertakes its marketing.Besides the few remaining big providers of enterprise software — like IBM, Oracle, HP, SAP — most vendors have totally remade their sales practices of just a few years ago. Large sales forces with big commissions and a year to two year sales cycles can no longer be justified when software license fees and the percentage maintenance annuities that flow from them are dropping rapidly. Today’s mantras are doing more with less and doing it faster, hardly consistent with the traditional enterprise software model. Sure, big enterprises, especially big government and big business, have large sunk costs in legacy systems that will continue to be milked by existing vendors. But the flow is constricting with longer-term trends clear to see. The old enterprise software model is obsolete.
Even if it were not dying, it is hard to square huge investments in sales and marketing when product development has become inexpensive and agile. The proliferation of three-letter marketing acronyms for branding “new” product areas and standard formulas for product hype of just a few years ago also feels old and dated. Cozy relationships with conventional trade press pundits and market analysts seem to be diminishing in importance, possibly because the authoritativeness of their influence is also diminishing. It is harder to justify market firm subscription costs when priority budget items are being cut and new information outlets have emerged.
In response to this, many developers have forsaken the enterprise market for the consumer one. Indeed enterprises themselves are looking more and more to the consumer sector and commodity apps for innovation and answers. But, still, problems unique to enterprises remain and how to effectively reach them in this brave new world is today’s marketing problem for enterprise software vendors.
Most entities today, when opining about these challenges, tend to emphasize the need for “laser focus” and “rifle-shot” targeting of prospects. The advice takes the form of: 1) emphasize well-defined verticals; 2) know your market well; and 3) target and go after your likely prospects. Prospect data mining and targeted ad analysis are the proferred elixirs.
But, there is little evidence such refined methods for prospect identification and targeting are really working. Like politicians doing focus groups and opinion polling to capture the desired “message” of their potential electorates, these are all still “push” models of marketing. Yet we are swamped with pushed messages and marketing everywhere we turn. The model is failing.
Besides message overload, there are two issues with laser targeting. First, despite all that we try to know about ready buyers (for enterprise software), we really don’t know if any particular individual is truly needful, in a position to buy, has the authority to buy, or is the right advocate to make the internal sell. Second, though the idea of “laser” carries with it the image of focus and not flailing, it is in fact expensive to identify the targets and send a focused message their way. Because of these issues, decay rates for laser prospects throughout conventional sales pipelines continue to rise.

There has always been the phenomenon of the “fish jumping into the boat“; that is, the unanticipated inbound inquiry from a previously unknown prospect leading to a surprisingly swift sale. But we have seen this phenomenon increase markedly in recent years. Structured Dynamics‘ current customer base — including recurring customers — comes almost exclusively from this source. As we have noted this trend in comparison with more targeted outreach, we have spent much time trying to understand why it is occurring and how we can leverage what Peter Drucker called the “unexpected success” [3].
What we are seeing, I believe, is a shift from sales to marketing, and within marketing from direct or outbound marketing to a new paradigm of marketing. Others have likened this to inbound marketing [4] or content marketing [5] or permission marketing [6]. What we are seeing at Structured Dynamics bears many resemblances to parts of what is claimed for these other approaches, but not all. And, it is also true that what we are seeing may pertain mostly to innovative IT for emerging enterprise markets, and not a generalized paradigm suitable to other products or markets.
For lack of a better term, what we are seeing we can term “substantive marketing”. By this we mean offering valuable content and solutions-oriented systems for free and without restriction. This shares aspects with content marketing. Then, in keeping with the trend for buyers doing their own research and analysis to fulfill their own needs, similar to the premises of inbound or permission marketing, potential consumers can make their own judgments as to relevance and value of our offerings.
Sometimes, of course, some prospects find our approaches and solutions lacking. Sometimes, they may grab what we have offered for free and use them on their own without compensation to us. But where the match is right — and we need to be honest with both ourselves and the customer when it is not — we can better spend the customer’s limited time and resources to tailor our generic solutions to their specific needs. In doing so, we offer higher value (tailored services) while learning better about another spectrum of consumer need that can virtuously enhance our substantive offerings for the next prospect.
So, let’s decompose these components further to see what they can tell us about this new practice of substantive marketing and how to use it as an engine for moving forward.
The premise of substantive marketing is to offer square-deal value to the marketplace in the form of solutions-based content. Like content marketing that offers “the creation or sharing of content for the purpose of engaging current and potential consumer bases” [5], substantive marketing goes even further. The whole basis and premise of the approach is to provide substantive content, in one of more of these areas, preferably all:
Further, this substantive content is offered without strings, restrictions or customer fill-in forms. The content is not a come on or a teaser. We are not trying to gather leads or prospect names, because we have no intent to dun them with emails or follow-ups.
This substantive content is as complete as can be to enable new users to adopt the information and tools in their current state without further assistance. (In some cases, the information also educates the marketplace in order to prepare future customers for adoption.) Most importantly, this substantive content is offered for free, either open source (for code) or creative commons for documentation and other content. In return, it is fair to request — and we do — attribution when this material is used.
We have previously termed this complete panoply of substantive content a total open solution [7]. Some might find the provision of such robust information crazy: How can we give away the store of our proprietary knowledge and systems?
But we find this kind of thinking old school. In an open source world where so much information is now available online, with a bit of effort customers can find this information anyway. Rather, our mindset is that customers do not want to pay again for what has already been done, but are willing to pay for what can be done with that knowledge for their own specific problems. Offering the complete storehouse of our knowledge in fact signals our interest in only charging the customer for new answers, new value or new formulations. The customers we like to work with feel they are getting an honest, square deal.
Consider your substantive content to be your flag, a unique banner for conveying and packaging your specific brand. It is thus important to find appropriate flagpoles — in the virtual territories that your customers visit — for raising this content high for them to see. Since the role of these flagpoles is to create awareness in potential prospects — who you do not likely know individually or even by group in advance — it makes sense to raise your offerings up on many flagpoles and on the highest flagpoles. Visibility is the object of the approach.
This approach is distinctly not leafletting or cramming links or emails into as many spaces as possible. The idea of substantive marketing is to fly valuable content high enough that desirous potential customers can discover and then inspect the information on their own, and only if they so choose. In this regard, substantive marketing resembles permission marketing [6].
Being visible helps ensure that the needful, questing prospect that you would never have been able to target on your own is able to see and be aware of your offerings. And, since they are seeking information and answers, your collateral needs to be of a similar nature. Solutions and substance are what they are seeking; what you have run up the flagpole should respond to that.
The mindset here is to respect your prospective customers and to allow them to chose to receive and inspect your offerings, but only if they so choose. If flown in the right venues with the right visibility, customers will see your flags and inspect them if they meet their requirements.
Some of the venues at which you can raise your flags include:
The observant reader will have already concluded that each of these venues develops slowly, and therefore raising visibility is generally a slow-and-steady game that requires patience. Start-up vendors backed by venture firms or those looking for quick visibility and cashout will not find this approach suitable. On the other hand, customer prospects looking for answers and self-sustaining solutions are not much interested in flash in the pan vendors, either.
The real drivers for this changing paradigm come from customer prospects. Sophisticated buyers of enterprise IT and instrumental change agents within organizations share most if not all of these characteristics:
More often than not we find our customers to have already installed and used our existing substantive materials for some time before they approach us about further work. They appreciate the tutorial information and have taught themselves much in advance. By the time we engage, both parties are able to cost-effectively focus on what is truly missing and needed and to deliver those answers in a quick way. Re-engagements tend to occur when a next set of gaps or challenges arise.
Though it may sound trite or even unbelievable to those who have not yet experienced such a relationship, the square deal value offered by substantive marketing can really lead to true partnerships and trust between vendor and customer. We experience it daily with our customers, and vice versa. We also think this is the adaptive approach that our new environment demands.
Once prospects learn of our substantive offerings, many may decide independently that what we have is not suitable. Others may simply download and use the information on their own, for which we often never know let alone receive revenue. We are completely fine with this, as shown for three different cases.
First, some of these prospects need no more than what we already have. This increases our user base, increases our visibility and often results in contributions to our forums and documentation.
Then, some of these prospects come to learn they need or want more than what our current offerings provide, leading to two possible forks. In one fork, the second case, they may have sufficient skills internally or with other suppliers to extend the system on their own. Some of this flows back to an improved code base or improved installation or documentation bases.
In the other fork, the third case, they may decide to engage us in tailoring a solution for them. That case is the only one of the three that leads to a direct revenue path.
In all three cases we win, and the customer wins. Maybe enterprise software vendors of decades past rue this reality of lower margins and shared benefits; we agree that the absolute profit potential of substantive marketing is much less. But we gladly accept the more enjoyable work and steady revenue relationships resulting from these changes. We are not engaged in some pollyann-ish altruism here, but in a steely-eyed honest brokering that best serves our own self-interest (and fairly that of the customer, as well).
Great IT product does not come from idle musings or dreamed up functionality. It comes solely and directly from solving customer problems. Only via customers can software be refined and made more broadly usable.
A slipstream of those who have previously become aware and tested our offerings will choose to engage our services. This generally takes the form of an inbound call, where the prospect not only qualifies itself, but also establishes the terms and conditions for the sale. They have chosen to select us; they are fish that have jumped into the boat.
To again quote Peter Drucker, “. . . the aim of marketing is to make selling superfluous. The aim of marketing is to know and understand the customer so well that the product or service fits him and sells itself. Ideally, marketing should result in a customer who is ready to buy. All that should be needed then is to make the product or service available . . .” [8]. This is precisely what I meant earlier about the shift in emphasis from sales to marketing.
Even at this point there may be mismatches in needs and our skills and availabilities. If such is the case, we do not hesitate to say so, and attempt to point the prospect in another direction (from which we also gain invaluable market knowledge). If there is indeed a match, we then proceed to try to find common ground on schedule and budget.
Paradoxically, this square deal and honesty about the readiness and weaknesses of our offerings often leads to forgiveness from our customers. For example, for some time we have lacked automated installation scripts that would make it easier for prospects to install our open semantic framework. But, because of compensating value in other areas, such gaps can be overlooked and tackled later on (indeed, as a current customer is now funding). By not pretending to be everything to everyone, we can offer what we do have without embarrassment and get on with the job of solving problems.
For larger potential engagements, we typically suggest a fixed price initial effort to develop an implementation plan. The interviews and research to support this typical 4- to 6-weeks effort (generally in the $5 K to $10 K range, depending) then result in a detailed fulfillment proposal, with firm tasks, budget and schedule, specific to that customer’s requirements. Just as we respect our prospects’ time and budget, we expect the same and do not conduct these detailed plans without compensation. With respect to fulfillment contracts, we cap contract amount and limit milestone payments to pre-set percentages or time expended, whichever is lower.
This approach ensures we understand the customer’s needs and have budgeted and tasked accordingly. Capped contracts also put the onus on us the contractor to understand our own effort and tasking structures and realities, which leads to better future estimating. For the customer, this approach caps risk and potential exposure, and ensures milestones are being met no matter the time expenditures by us, the contractor. This approach extends our square-deal basis to also embrace risks and payments.
Thus, when customers engage us, they spend almost solely on new functionality specifically tailored to their needs. In doing so, we suggest they agree to release the new developments they fund as open source. We argue — and customers predominantly agree — that they are already benefitting from lower overall costs because other customers have funded sharable, open source before them. We point out that the new customers that follow them will also be independently creating new functionality, to which they will also later benefit.
(This argument does not apply to specific customer data or ontologies, which are naturally proprietary to the customer. Also, if the customer wants to retain intellectual ownership of extensions, we charge higher development fees.)
Once these new developments are completed, they are fed back into a new baseline of valuable content and code. From this new baseline the cycle of substantive marketing can be augmented anew and perpetuated.
All of these points can really be boiled down to three guidelines for how to make substantive marketing effective:
What we are finding — as we continue to refine our understanding of this new paradigm — is that through substantive marketing the fish are finding us and they sometimes jump into the boat. We like our enterprise customers to pre-qualify themselves and already be “sold” once they knock on the door. One never knows when that phone might ring or the email might come in. But when it does, it often results in a collaborative customer as a partner who is a joy to work with to solve exciting new problems.

Ever since I first started to learn in earnest about ontology, something has been gnawing at me. The term seemed to be (shall I say?) an obtuse one whose obscurity was not the result of subtle precision or technicality, but rather one of fuzziness. As I introduced my Intrepid Guide to Ontology two years ago, I noted:
Since then, I have continued to find ontology one of the hardest concepts to communicate to clients and quite a muddled mess even as used by practitioners. I have come to the conclusion that this problem is not because I have failed to grasp some ephemeral nuance, but because the term as used in practice is indeed fuzzy and imprecise.
Even two years ago, I noted more than 40 different types of information structure that have at one time or another been labelled as an example of an “ontology”:
Since then, I could add even more terms to this list.
Lack of precision as to what ontology means has meant that it has been sloppily defined. As I have harped upon many times regarding semantic Web terminology, this is a sad state of affairs for the semWeb endeavor that has meaning at the core of its purpose.
I’m pretty sure that the original intent in embracing the concept of ontology within the realm of knowledge representation was not to see this term so broadly misused or mis-applied. I suspect, as well, that if we could sharpen up our understanding and remove some of the fuzziness that we could improve communications with the lay public across many levels of the semWeb enterprise.
Recently, I have been looking to the semantic Web’s roots in description logics. One of my writings, Thinking ‘Inside the Box’ with Description Logics, looked at the conceptual distinctions between the so-called ‘TBox‘ and ‘ABox‘. That is, a knowledge base is a logical schema of roles and concepts and the relationships between them (the TBox), which is populated by the actual data (instances) asserting memberships and attributes (“facts”) (the ABox).
By analogy, in a conventional relational database system, the database or logical schema would correspond to the TBox; the actual data records or tables would correspond to the ABox. Often, the term ontology is used to cover both ABox and TBox statements (which, I argue, only makes the understanding of the ‘ontology’ concept more difficult).
My recent writing, Back to the Future with Description Logics, discussed at some length the advantages of keeping the TBox and ABox separate. This current article now expands on those thoughts, particularly with respect to the definition and understanding of ontology.
The starting point for this new mindset is to return to the ideas of data records or data tables v. the logical schema that is prevalent in relational databases.
The last time I took a census, about a year ago, there were more than 100 converters of various record and data structure types to RDF [2]. These converters — also sometimes known as translators or ‘RDFizers’ — generally take some input data records with varying formats or serializations and convert them to a form of RDF serialization (such as RDF/XML or N3), often with some ontology matching or characterizations. That last census listed these converters:
|
|
Note that MIT’s SIMILE RDFizers also recognizes these formats:
|
There is a growing list of third-party RDFizers as well:
|
This wealth of formats shows the robustness of the RDF data model to capture structure and data relationships from virtually any input form. This is what makes RDF so exciting as a canonical target for getting data to interoperate.
However — and this is crucial — standard users for decades have preferred simple, text-based and human readable formats for writing and transferring their structured data.
These various forms, sometimes well specified with APIs and sometimes almost ad hoc as in spreadsheet listings, are what we call ‘structs‘. Structs can all be displayed as text and have, at minimum, explicit or inferrable key-value pairs to convey data relationships and attributes, with data types and values often noted by various white space, delimiter or other text conventions.
There is no doubt that the vast majority of extant data is found in such formats, including the results of data or information extraction from unstructured text. Indeed, even HTML and many markup languages with their angle bracket-delimited fields fall into this category.
There have literally been hundreds of various formats proposed over decades for conveying lightweight data structures. Most have been proprietary or limited to specific domains or users. Some, such as fielded text, structured text, simple declarative language (SDL), or more recently YAML or its simpler cousin JSON, have become more widely adopted and supported by formal specifications, tools or APIs. JSON, especially, is a preferred form for Web 2.0 applications.
Some, like microformats or this example BibTeX record below (with some non-standard extensions), rely less on syntax conventions and may use reserved keywords (such as AUTHOR or TITLE as shown) to signal the key type for the key-value pair:
ID_LOCAL arXiv:0711.3808 AUTHOR <a href="#Schramm_O">Oded Schramm</a> BIBTYPE ARTICLE ID arXiv:0711.3808 JOURNAL Electron. Res. Announc. Math. Sci. PAGES 17--23 SUBJECTS geom TITLE Hyperfinite graph limits URL http://www.aimsciences.org/journals/doIpChk.jsp?paperID=3117&mode=full URL http://www.aimsciences.org/journals/displayPapers0.jsp?comments=&pubID=221&journID=14&pubString_num=Volume: 15, 2008 Journal Issue VOLUME 15 YEAR 2008
Some of these simple formats have been more successful than others, though none have achieved market dominance. There also appear to be few universal principles that have emerged as to syntax or format. Nonetheless, any of these various struct forms are easy for casual readers to understand and easy for domain experts to write.
For modeling and interoperability purposes, many of these forms are patently inadequate. That is why many of these simpler forms might be called “naïve”: they achieve their immediate purpose of simple relationships and communication, but require understood or explicit context in order to be meaningfully (semantically) related to other forms or data.
Yet, if we have learned nothing else with the phenomenal success of the Web it is this: simplicity trumps elegance or expressivity.
The RDF (Resource Description Framework) data model is expressed as simple subject-predicate-object “triple” statements. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or, the ball is round. It may sound like a kindergartner reader, but it is how data can be easily represented and built up into more complex structures and stories.
RDF triples can be applied equally to all structured, semi-structured and unstructured content. RDF is clearly a most capable data model that — through its ability to be extended with further concepts and relationships (vocabulary) — can create elegant and logical structures to represent comprehensive domains and knowledge bases. Finding such a model has been a quest in my professional life; I believe we finally have a winner to facilitate data interoperability using RDF.
But RDF has not achieved the market acceptance that its suitability as a data representation model might suggest. I think there are three reasons for this:
Canonical forms embody all of the specification that the canon guiding them requires. What we may have failed to see in embracing RDF, however, is that getting useful data into the system need not carry all of this burden.
So, what does all of this have to do with my starting diatribe about the term ontology?
Whether a single database or the federation across all information known to human kind, we have data records (structs of instances) and a logical schema (ontology of concepts and relationships) by which we try to relate this information. This is a natural and meaningful split: structure and relationships v. the instances that populate that structure.
Stated this way, particularly for anyone with a relational database background, the split between schema and data is clear and obvious. Yet, the RDF, semantic Web and linked data communities have done an abysmal job of recognizing this fundamental separation of concerns.
We create “ontologies” that mix instances and schema. We insist on simple data record conversions that are burdened with relationship specifications as well. We tout a “linked data” infrastructure that is based solely on the same identity of instances without respect or attention to structure or conceptual relationships. We dismiss communities that work to express their data with useful local structures. We insist on standards and practices up and down the data staging and preparation chain that turns off the general market and makes us seem arrogant and dismissive. Frankly, in so many ways, we just don’t get it [3].
What has struck me personally over the past few months as these realizations have unfolded has been how much our own mindsets and language may be trapping us.
At least for this diatribe, my essential conclusion is that we need to shift the burden of the schema and conceptual relations and (yes) world views to the TBox. We need to skinny down the ABox and make it a warm and welcoming environment by which any structured data (including the most naïve) can join.
So, ultimately, the bottom line is this: the burden of the semantic Web rests on us, not the providers of structured data.
It is time to streamline the ABox to smooth data contributions, assume as publishers the responsibility for the TBox, and keep those concerns separate. As for instance-related stuff, I now intend to refer to them as structs governed by a controlled vocabulary (at most). I intend to reserve ontology as a means to describe a given world view, a TBox, the schema and its relations of the domain at hand. And, frankly, this definition of ontology brings it back in balance with its roots in ontos and the nature of the world.
It’s a good time to lighten up!
This Friday brown bag leftover was first placed into the AI3 refrigerator on January 22, 2009, and is one of the more popular historical posts of this blog. This reprise is unchanged since its original posting, though we have continued to make progress on constructs such as irON to capture this idea. Microdata in HTML5 is also an important contribution, to which we will devote some attention in the near future.

Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.
The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”
While there have been attempts to strap on more or less formal understandings or machinery around ontology, it still has very much the sense of a world view, a means of viewing and organizing and conceptualizing and defining a domain of interest. As is made clear below, I personally prefer a loose and embracing understanding of the term (consistent with Deborah McGuinness’s 2003 paper, Ontologies Come of Age [1]).
There has been a resurgence of interest in ontologies of late. Two reasons have been the emergence of Web 2.0 and tagging and folksonomies, as well as the nascent emergence of the structured Web. In fact, on April 23-24 one of the noted communities of practice around ontologies, Ontolog, sponsored the Ontology Summit 2007 ,”Ontology, Taxonomy, Folksonomy: Understanding the Distinctions.”
These events have sparked my preparing this guide to ontologies. I have to admit this is a somewhat intrepid endeavor given the wealth of material and diversity of opinions.
This Friday brown bag leftover was first placed into the AI3 refrigerator more than three years ago on May 16, 2007. This reprise is unchanged since its original posting, though there is a more recent executive-level intro to ontologies on the OpenStructs‘ TechWiki.Of course, a fancy name is not sufficient alone to warrant an interest in ontologies. There are reasons why understanding, using and manipulating ontologies can bring practical benefit:
Both structure and formalism are dimensions for classifying ontologies, which combined are often referred to as an ontology’s “expressiveness.” How one describes this structure and formality differs. One recent attempt is this figure from the Ontology Summit 2007‘s wrap-up communique:
Note the bridging role that an ontology plays between a domain and its content. (By its nature, every ontology attempts to “define” and bound a domain.) Also note that the Summit’s 50 or so participants were focused on the trade-off between semantics v. pragmatic considerations. This was a result of the ongoing attempts within the community to understand, embrace and (possibly) legitimize “less formal” Web 2.0 efforts such as tagging and the folksonomies that can result from them.
There is an M.C. Escher-like recursion of the lizard eating its tail when one observes ontologists creating an ontology to describe the ontological domain. The above diagram, which itself would be different with a slight change in Summit participation or editorship, is, of course, but one representative view of the world. Indeed, a tremendous variety of scientific and research disciplines concern themselves with classifying and organizing the “nature of things.” Those disciplines go by such names as logicians, taxonomists, philosophers, information architects, computer scientists, librarians, operations researchers, systematicists, statisticians, historians, and so forth. (In short, given our ontos, every area of human endeavor has the urge to classify, to organize.) In each of these areas not only do their domains differ, but so do the adopted structures and classification schemes often used.
There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach:
Actual domains or subject coverage are then mostly orthogonal to these approaches.
Loosely defined, the number of possible ontologies is therefore close to infinite: domain X perspective X schema. (Just kidding — sort of! In fact, UMBC’s Swoogle ontology search service claims 10,000 ontologies presently on the Web; the actual data from August 2006 ranges from about 16,000 to 92,000 ontologies, depending on how “formal” the definition. These counts are also limited to OWL-based ontologies.)
Many have misunderstood the semantic Web because of this diversity and the slipperiness of the concept of an ontology. This misunderstanding becomes flat wrong when people claim the semantic Web implies one single grand ontology or organizational schema, One Ring to Rule Them All. Human and domain diversities makes this viewpoint patently false.
The choice of an ontological approach to organize Web and structured content can be contentious. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language or microformats. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) and the still greater formalism of OWL’s various dialects.
Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it to a canonical form, we can then move on to needed developments in semantic mediation, the threshold condition for the semantic Web.
So, diversity is inevitable and should be accepted. But that observation need not also embrace chaos.
In my early training in biological systematics, Ernst Haeckel’s recapitulation theory that “ontogeny recapitulates phylogeny” (note the same ontos root, the difference from ontology being growth v. study) was losing favor fast. The theory was that the development of an organism through its embryological phases mirrors its evolutionary history. Today, modern biologists recognize numerous connections between ontogeny and phylogeny, explain them using evolutionary theory, or view them as supporting evidence for that theory.
Yet, like the construction of phylogenetic trees, systematicists strive for their classifications of the relatedness of organisms to be “natural”, to reflect the true nature of the relationship. Thus, over time, that understanding of a “natural” system has progressed from appearance → embryology → embryology + detailed morphology → species and interbreeding → DNA. While details continue to be worked out, the degree of genetic relatedness is now widely accepted by biologists as a “natural” basis for organizing the Tree of Life.
It is not unrealistic to also seek “naturalness” in the organization of other knowledge domains, to seek “naturalness” in the organization of their underlying ontologies. Like natural systems in biology, this naturalness should emerge from the shared understandings and perceptions of the domain’s participants. While subject matter expertise and general and domain knowledge are essential to this development, they are not the only factors. As tagging systems on the Web are showing, common usage and broad acceptance by the community at hand is important as well.
While it may appear that a domain such as the biological relatedness of organisms is more empirical than the concepts and ambiguous words in most domains of human endeavor, these attempts at naturalness are still not foolish. The phylogeny example shows that understanding changes over time as knowledge is gained. We now accept DNA over the recapitulation theory.
As the formal SKOS organizational schema for knowledge systems recognizes (see below), the ideas of narrower and broader concepts can be readily embraced, as well as concepts of relatedness and aliases (synonyms). These simple constructs, I would argue, plus the application of knowledge being gained in related domains, will enable tomorrow’s understandings to be more “natural” than today’s, no matter the particular domain at hand.
So, in seeking a “naturalness” within our organizational schema, we can also see that change is a constant. We also see that the tools and ideas underlying the seemingly abstract cause of merging and relating existing ontologies to one another will further a greater “naturalness” within our organizations of the world.
According to the Summit, expressiveness is the extent and ease by which an ontology can describe domain semantics. Structure they define as the degree of organization or hierarchical extent of the ontology. They further define granularity as the level of detail in the ontology. And, as the diagram above alludes, they define other dimensions of use, logical basis, purpose and so forth of an ontology.
The over fifty respondents from 42 communities submitted some 70 different ontologies under about 40 terms to a survey that was used by the Summit to construct their diagram. These submissions included:
I think the simplest spectrum for such distinctions is the formalism of the ontology and its approach (and language or syntax, not further discussed here). More formal ontologies have greater expressiveness and structure and inferential power, less formal ones the opposite. Constructing more formal ontologies is more demanding, and takes more effort and rigor, resulting in an approach that is more powerful but also more rigid and less flexible. Like anything else, there are always trade-offs.
Based on work by Leo Obrst of Mitre as interpreted by Dan McCreary, we can view this as a trade-off as one of semantic clarity v. the time and money required to construct the formalism [12, 13]:
Note this diagram reflects the more conventional, practitioner’s view of the “formal” ontology, which does not include taxonomies or controlled vocabularies (for example) in the definition. This represents the more “closely defined” end of the ontology (semantic) spectrum.
However, since we are speaking here of ontologies and the structured Web or the semantic Web, I believe we need to embrace a concept of ontology aligned to actual practice. Not all content providers can or want to employ ontology engineers to enable formal inferencing of their content. Yet, on the other hand, their content in its various forms does have some meaningful structure, some organization. The trick is to extract this structure for more meaningful use such as data exchange or data merging.
Under such “loosely defined” bases we can thus see a spectrum of ontology approaches on the Web, proceeding from less structure and formalism to more so:
| Type or Schema | Examples | Comments on Structure and Formalism | |
| Standard Web Page | entire Web | General metadata fields in the and internal HTML codes and tags provide minimal, but useful sources of structure; other HTTP and retrieval data can also contribute | |
| Blog / Wiki Page | examples from Technorati, Bloglines, Wikipedia | Provides still greater formalism for the organization and characterization of content (subjects, categories, posts, comments, date/time stamps, etc.). Importantly, with the addition of plug-ins, some of the basic software may also provide other structured characterizations or output (SIOC, FOAF, etc.; highly varied and site-specific given the diversity of publishing systems and plug-ins) | |
| RSS / Atom feeds | most blogs and most news feeds | RSS extends basic XML schema for more robust syndication of content with a tightly controlled vocabulary for feed concepts and their relationships. Because of its ubiquity, this is becoming a useful baseline of structure and formalism; also, the nature of adoption shows much about how ontological structure is an artifact, not driver, for use | |
| RSS / Atom feeds with tags or OPML | Grazr, most newsfeed aggregators can import and export OPML lists of RSS feeds | The OPML specification defines an outline as a hierarchical, ordered list of arbitrary elements. The specification is fairly open which makes it suitable for many types of list data. See also OML and XOXO | |
| Hierarchical Faceted Metadata | XFML, Flamenco | These and related efforts from the information architecture (IA) community are geared more to library science. However, they directly contribute to faceted browsing, which is one of the first practical instantiations of the semantic Web | |
| Folksonomies | Flickr, del.icio.us | Based on user-generated tags and informal organizations of the same; not linked to any “standard” Web protocols. Both tags and hierarchical structure are arbitrary, but some researchers now believe over large enough participant sets that structural consensus and value does emerge | |
| Microformats | Example formats include hAtom, hCalendar, hCard, hReview, hResume, rel-directory, xFolk, XFN and XOXO | A microformat is HTML mark up to express semantics with strictly controlled vocabularies. This markup is embedded using specific HTML attributes such as class, rel, and rev. This method is easy to implement and understand, but is not free-form | |
| Embedded RDF | RDFa, eRDF | An embedded format, like microformats, but free-form, and not subject to the approval strictures associated with microformats | |
| Topic Maps | Infoloom, Topic Maps Search Engine | A topic map can represent information using topics (representing any concept, from people, countries, and organizations to software modules, individual files, and events), associations (which represent the relationships between them), and occurrences (which represent relationships between topics and information resources relevant to them) | |
| RDF | Many; DBpedia, etc. | RDF has become the canonical data model since it represents a “universal” conversion format | |
| RDF Schema | SKOS, SIOC, DOAP, FOAF | RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. This becomes the canonical ontology common meeting ground | |
| OWL Lite | Here are some existing OWL ontologies; also see Swoogle for OWL search facilities | The Web Ontology Language (OWL) is a language for defining and instantiating Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans. It facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. The three language versions are in order of increasing expressiveness | |
| OWL DL | |||
| OWL Full | |||
| Higher-order “formal” and “upper-level” ontologies | SUMO, DOLCE, PROTON, BFO, Cyc, OpenCyc | These provide comprehensive ontologies and often related knowledge bases, with the goal of enabling AI applications to perform human-like reasoning. Their reasoning languages often use higher-order logics |
As a rule of thumb, items that are less “formal” can be converted to a more formal expression, but the most formal forms can generally not be expressed in less formal forms.
As latter sections elaborate, I see RDF as the universal data model for representing this structure into a common, canonical format, with RDF Schema (specifically SKOS, but also supplemented by FOAF, DOAP and SIOC) as the organizing ontology knowledge representation language (KRL).
This is not to say that the various dialects of OWL should be neglected. In bounded environments, they can provide superior reasoning power and are warranted if they can be sufficiently mandated or enforced. But the RDF and RDF-S systems represent the most tractable “meeting place” or “middle ground,” IMHO.
As if the formalism dimension were not complicated enough, there is also the practice within the ontology community to characterize ontologies by “levels”, specifically upper, middle and lower levels. For example, chances are that you have heard particularly of “upper-level” ontologies.
The following figure helps illustrate this “level” dimension. This diagram is also from Leo Obrst of Mitre [12], and was also used in another 2006 talk by Jack Park and Patrick Durusau (discussed further below for other reasons):

Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak [2].
The above diagram conveys a sense of how multiple ontologies can relate to one another both in terms of narrower and broader topic matter and at the same “levels” of generalization. Such “meta-structure” (if you will) can provide a reference structure for relating multiple ontologies to one another.
It resides exactly in such bindings or relationships that we can foresee the promise of querying and relating multiple endpoints on the Web with accurate semantics in order to connect dots and combine knowledge bases. Thus, the understanding of the relationships and mappings amongst ontologies becomes a critical infrastructural component of the semantic Web.
We can better understand these mapping and inter-relationship concepts by using a concrete example with a formal ontology. We’ll choose to use the Suggested Upper Merged Ontology simply because it is one of the best known. We could have also selected another upper-level system such as PROTON [3] or Cyc [4] or one of the many with narrower concept or subject coverage.
SUMO is one of the formal ontologies that has been mapped to the WordNet lexicon, which adds to its semantic richness. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE. The ontologies that extend SUMO are available under GNU General Public License.
The abstract, conceptual organization of SUMO is shown by this diagram, which also points to its related MILO (MId-Level Ontology), which is being developed as a bridge between the abstract content of the SUMO and the richer detail of various domain ontologies:

At this level, the structure is quite abstract. But one can easily browse the SUMO structure. A nifty tool to do so is the KSMSA (Knowledge Support for Modeling and Simulation) ontology browser. Using a hierarchical tree representation, you can navigate through SUMO, MILO, WordNet, and (with the locally installed version) Wikipedia.
The figure below shows the upper-level entity concept on the left; the right-hand panel shows a drill-down into the example atom entity:
These views may be a bit misleading because the actual underlying structure, while it has hierarchical aspects as shown here, really is in the form of a directed acyclic graph (showing other relatedness options, not just hierarchical ones). So, alternate visualizations include traditional network graphs.
The other thing to note is that the “things” covered in the ontology, the entities, are also fairly abstract. That is because the intention of a standard “upper-level” ontology is to cover all relevant knowledge aspects of each entity’s domain. This approach results in a subject and topic coverage that feels less “concrete” than the coverage in, say, an encyclopedia, directory or card catalog.
According to Park and Durusau, upper ontologies are diverse, middle ontologies are even more diverse, and lower ontologies are more diverse still. A key observation is that ontological diversity is a given and increases as we approach real user interaction levels. Moreover, because of the “loose” nature of ontologies on the Web (now and into the future), diversity of approach is a further key factor.
Recall the initial discussion on the role and objectives of ontologies. About half of those roles involve effectively accessing or querying more than one ontology. The objective of “upper-level” ontologies, many with their own binding layers, is also expressly geared to ontology integration or federation. So, what are the possible mechanisms for such binding or integration?
A fundamental distinction within mechanisms to combine ontologies is whether it is a unified or centralized approach (often imposed or required by some party) or whether it is a schema mapping or binding approach. We can term this distinction centralized v. federated.
Centralized approaches can take a number of forms. At the most extreme, adherence to a centralized approach can be contractual. At the other end are reference models or standards. For example, illustrative reference models include:
Though I have argued that One Ring to Rule them All is not appropriate to the general Web, there may be cases within certain enterprises or where through funding clout (such as government contracts), some form of centralized approach could be imposed [5]. And, frankly, even where compliance can not be assured, there are advantages in economy, efficiency and interoperability to attempt central ontologies. Certain industries — notably pharmaceuticals and petrochemicals — and certain disciplines — such as some areas of biology among others — have through trade associations or community consensus done admirable jobs in adopting centralized approaches.
However, combining ontologies in the context of the broader Internet is more likely through federated approaches. (Though federated approaches can also be improved when there are consensual standards within specific communities.) The key aspect of a federated approach is to acknowledge that multiple schema need to be brought together, and that each contributing data set and its schema will not be altered directly and will likely remain in place.
Thus, the key distinctions within this category are the mechanisms by which those linkages may take place An important goal in any federated approach is to achieve interoperability at the data or instance level without unacceptable loss of information or corruption of the semantics. Numerous specific approaches are possible, but three example areas in RDF-topic map interoperability, the use of “subject maps”, and binding layers can illustrate some of the issues at hand.
In 2006 the W3C set up a working group to look at the issue of RDF and topic maps interoperability. Topic maps have been embraced by the library and information architecture community for some time, and have standards that have been adopted under ISO. Somewhat later but also in parallel was the development of the RDF standard by W3C. The interesting thing was that the conceptual underpinnings and objectives between these two efforts were quite similar. Also, because of the substantive thrust of topic maps and the substantive needs of its community, quite a few topic maps had been developed and implemented.
One of the first efforts of the W3C work group was to evaluate and compare five or six extant proposals for how to relate RDF and topic maps [6]. That report is very interesting reading for any one desirous of learning more about specific issues in combining ontologies and their interoperability. The result of that evaluation then led to some guidelines for best practices in how to complete this mapping [7]. Evaluations such as these provide confidence that interoperability can be achieved between relatively formal schema definitions without unacceptable loss in meaning.
A different, “looser” approach, but one which also grew out of the topic map community, is the idea of “subject maps.” This effort, backed by Park and Durusau noted above, but also with the support of other topic map experts such as Steve Newcomb and Robert Barta via their proposed Topic Maps Reference Model (ISO 13250-5), seems to be one of the best attempts I’ve seen that both respects the reality of the actual Web while proposing a workable, effective scheme for federation.
The basic idea of a subject map is built around a set of subject “proxies.” A subject proxy is a computer representation of a subject that can be implemented as an object, must have an identity, and must be addressable (this point provides the URI connector to RDF). Each contributing schema thus defines its own subjects, with the mappings becoming meta-objects. These, in turn, would benefit from having some accepted subject reference schema (not specifically addressed by the proponents) to reduce the breadth of the ultimate mapped proxy “space.”
I don’t have the expertise to judge further the specifics, but I find the presentation and papers by Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies to be worthwhile reading in any case. I highly recommend these papers for further background and clarity.
As the third example, “binding layers” are a comparatively newer concept. Leading upper-level ontologies such as SUMO or PROTON propose their own binding protocols to their “lower” domains, but that approach takes place within the construct of the parent upper ontology and language. Such designs are not yet generalized solutions. By far the most promising generalized binding solution is the SKOS (Simple Knowledge Organization System). Because of its importance, the next section is devoted to it.
Finally, with respect to federated approaches, there are quite a few software tools that have been developed to aid or promote some of these specific methods. For, example, about twenty of the software applications in my Sweet Tools listing of 500+ semantic Web and -related tools could be interpreted as aiding ontology mapping or creation. You may want to check out some of these specific tools depending on your preferred approach [8].
SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the “loosely defined” ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide [9].
SKOS is built upon the RDF data model of the subject-predicate-object “triple.” The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).
Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.
This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:

The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as “broader” and “narrower”, which enable hierarchical relations to be modeled, as well as “related” and “member” to support networks and arrays, respectively [9].
We can visualize this transforming power by looking at how an “ontology” in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:
SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it [9]. There is also a useful draft schema that the W3C’s SWEO (Semantic Web Education and Outreach) group is developing to organize semantic Web-related information [10].
Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.
While not addressed directly in this piece, it is obviously of first importance to have content with structure before the questions of connecting that information can even arise. Then, that structure must also be available in a form suitable for merging or connection.
At that point, the subjects of this posting come into play.
We see that the daily Web has a diversity of schema or ontologies “loosely defined” for representing the structure of the content. These representations can be transferred to more complex schema, but not in the opposite direction. Moreover, the semantic basis for how to make these mappings also needs some common referents.
RDF provides the canonical data model for the data transfers and representations. RDFS, especially in the form of SKOS, appears to form one basis for the syntax and language for these transformations. And SKOS, with other schema, also appears to offer much of the appropriate “middle ground” for data relationships mapping.
However, lacking in this story is a referential structure for subject relationships [11]. (Also lacking are the ultimately critical domain specifics required for actual implementation.)
Abstract concepts of interest to philosophers and deep thinkers have been given much attention. Sadly, to date, concrete subject structures in which tangible things and tangible actions can be shared, is still very, very weak. We are stubbing our toes on the rocks while we gaze at the heavens.
Yet, despite this, simple and powerful infrastructures are well in-hand to address all foreseeable syntactic and semantic issues. There appear to be no substantive limits to needed next steps.
Lastly, many valuable resources for further reading and learning may be found within the Ontolog Community, W3C, TagCommons and Topics Maps groups. Enjoy! And be wary of ontology no longer.

For a dozen years, my career has been centered on Internet search, dynamic content and the deep Web. For the past few years, I have been somewhat obsessed by two topics.
The first topic, a conviction really, is that implicit structure needs to be extracted from Web content to enable it to be disambiguated, organized, shared and re-purposed. The second topic, more an open question as a former academic married to a professor, is what might replace editorial selections and peer review to establish the authoritativeness of content. These topics naturally steer one to the semantic Web.
The semantic Web, by whatever name it comes to be called, is an inevitability. History tells us that as information content grows, so do the mechanisms for organizing and managing it. Over human history, innovations such as writing systems, alphabetization, pagination, tables of contents, indexes, concordances, reference look-ups, classification systems, tables, figures, and statistics have emerged in parallel with content growth [19].
When the Lycos search engine, one of the first profitable Internet ventures, was publicly released in 1994, it indexed a mere 54,000 pages [1]. When Google wowed us with its page-ranking algorithm in 1998, it soon replaced my then favorite search engine, AltaVista. Now, tens of billions of indexed documents later, I often find Google’s results to be overwhelming dross — unfortunately true again for all of the major search engines. Faceted browsing, vertical search, and Web 2.0′s tagging and folksonomies demonstrate humanity’s natural penchant to fight this entropy, efforts that will next continue with the semantic Web and then mechanisms unforeseen to manage the chaos of burgeoning content.
An awful lot of hot air has been expelled over the false dichotomy of whether the semantic Web will fail or is on the verge of nirvana. Arguments extend from the epistemological versus ontological (classically defined) to Web 3.0 versus SemWeb or Web services (WS*) versus REST (Representational State Transfer). My RSS feed reader points to at least one such dust up every week.
Some set the difficulties of resolving semantic heterogeneities as absolutes, leading to an illogical and false rejection of semantic Web objectives. In contrast, some advocates set equally divisive arguments for semantic Web purity by insisting on formal ontologies and descriptive logics. Meanwhile, studied leaks about “stealth” semantic Web ventures mean you should grab your wallet while simultaneously shaking your head.
My mental image of the semantic Web is a road from here to some achievable destination — say, Detroit. Parts of the road are well paved; indeed, portions are already superhighways with controlled on-ramps and off-ramps. Other portions are two lanes, some with way too many traffic lights and some with dangerous intersections. A few small portions remain unpaved gravel and rough going.
A lack of perspective makes things appear either too close or too far away. The automobile isn’t yet a century old as a mass-produced item. It wasn’t until 1919 that the US Army Transcontinental Motor Convoy made the first automobile trip across the United States.
The 3,200 mile route roughly followed today’s Lincoln Highway, US 30, from Washington, D.C. to San Francisco. The convoy took 62 days and 250 recorded accidents to complete the trip (see figure), half on dirt roads at an average speed of 6 miles per hour. A tank officer on that trip later observed Germany’s autobahns during World War II. When he subsequently became President Dwight D. Eisenhower, he proposed and then signed the Interstate Highway Act.
That was 50 years ago. Today, the US is crisscrossed with 50,000 miles of interstates, which have completely remade the nation’s economy and culture [2].
Like the interstate system in its early years, today’s semantic Web lets you link together a complete trip, but the going isn’t as smooth or as fast as it could be. Nevertheless, making the trip is doable and keeps improving day by day, month by month.
My view of what’s required to smooth the road begins with extracting structure and meaningful information according to understandable schema from mostly uncharacterized content. Then we store the now-structured content as RDF triples that can be further managed and manipulated at scale. By necessity, the journey embraces tools and requirements that, individually, might not constitute semantic Web technology as some strictly define it. These tools and requirements are nonetheless integral to reaching the destination. We are well into that journey’s first leg, what I and others are calling the structured Web.
For the past six months or so I have been researching and assembling as many semantic Web and related tools as I can find [3]. That Sweet Tools listing now exceeds 500 tools [4] (with its presentation using the nifty lightweight Exhibit publication system from MIT’s Simile program [5]). I’ve come to understand the importance of many ancillary tool sets to the entire semantic Web highway, such as natural language processing and information extraction. I’ve also found new categories of pragmatic tools that embody semantic Web and data mediation processes but don’t label themselves as such.
In its entirety, the Sweet Tools listing provides a pretty good picture of the semantic Web’s state. It’s a surprisingly robust picture — though with some notable potholes — and includes impressive open source options in all categories. Content publishing, indexing, and retrieval at massive scales are largely solved problems. We also have the infrastructure, languages, and (yes!) standards for tying this content together meaningfully at the data and object levels.
I also think a degree of consensus has emerged on RDF as the canonical data model for semantic information. RDF triple stores are rapidly improving toward industrial strength, and RESTful designs enable massive scalability, as terabyte- and petabyte-scale full-text indexes prove.
Powerful and flexible middleware options, such as those from OpenLink [6], can transform and integrate diverse file formats with a variety of back ends. The World Wide Web Consortium’s GRDDL standard [7] and related tools, plus various “RDF-izers” from Massachusetts Institute of Technology and elsewhere [8], largely provide the conversion infrastructure for getting Web data into that canonical RDF form. Sure, some of these converters are still research-grade, but getting them to operational capabilities at scale now appears trivial.
Things start getting shakier when trying to structure information into a semantic formalism. Controlled vocabularies and ontologies range broadly and remain a contentious area. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language [9] or microformats [10]. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) [11] and the still greater formalism of OWL’s various dialects [12].
| If we compare the semantic Web to the US interstate highway system, we’re still in the early stages of a journey that will remake our economy and culture. |
| Many potholes on the road to the semantic Web exist. |
| One ready task is to transform existing structure to RDF. Another priority is to refine tools to extract structure and meaningful information from uncharacterized content. |
Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications.
All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it into canonical RDF form, we can then move on to needed developments in semantic mediation, some of the roughest road on the journey.
Semantic mediation requires appropriate structured content. Many potholes on the road to the semantic Web exist because the content lacks structured markup; others arise because existing structure requires transformation. We need improved ways to address both problems. We also need more intuitive means for applying schema to structure. Some have referred to these issues as “who pays the tax.”
Recent experience with social software and collaboration proves that a portion of the Internet user community is willing to tag and characterize content. Furthermore, we can readily leverage that resulting structure, and free riders are welcomed. The real pothole is the lack of easy — even fun — data extractors and “structurizers.” But we’re tantalizingly close.
Tools such as Solvent and Sifter from MIT’s Simile program [13] and Marmite from Carnegie Mellon University [14] are showing the way to match DOM (document object model) inspectors with automated structure extractors. DBpedia, the alpha version of Freebase, and System One now provide large-scale, open Web data sets in RDF [15], including all of Wikipedia. Browser extensions such as Zotero [16] are showing how to integrate structure management into acceptable user interfaces, as are services such as Zoominfo [17]. Yet we still lack easy means to design the differing structures suitable for a plenitude of destinations.
Amazingly, a compelling road map for how all these pieces could truly fit together is also incomplete. How do we actually get from here to Detroit? Within specific components, architectural understandings are sometimes OK (although documentation is usually awful for open source projects, as most of the current tools are). Until our community better documents that vision, attracting new contributors will be needlessly slower, thus delaying the benefits of network effects.
So, let’s create a road map and get on with paving the gaps and filling the potholes. It’s not a matter of standards or technology — we have those in abundance. Let’s stop the silly squabbles and commit to the journey in earnest. The structured Web‘s ability to reach Hyperland [18], Douglas Adam’s prescient 1990 forecast of the semantic Web, now looks to be no further away than Detroit.
This Friday brown bag leftover was first placed into the AI3 refrigerator about three years ago on May 3, 2007. The piece was my answer to a request by Jim Hendler to pen some thoughts on the semantic Web, based on I believe what he thought might be a pragmatic perspective combining Internet business with Web science. The formal piece appeared as a guest editorial in the May/June 2007 issue of IEEE Intelligent Systems. What appears above is unaltered from my original posting (aside from some minor formatting clean-up and — sorry to say — some of the projects are now defunct).


In 2002 Joel Mokyr, an economic historian from Northwestern University, wrote a book that should be read by anyone interested in knowledge and its role in economic growth. The Gifts of Athena : Historical Origins of the Knowledge Economy is a sweeping and comprehensive account of the period from 1760 (in what Mokyr calls the “Industrial Enlightenment”) through the Industrial Revolution beginning roughly in 1820 and then continuing through the end of the 19th century.
The book (and related expansions by Mokyr available as separate PDFs on the Internet) should be considered as the definitive reference on this topic to date. The book contains 40 pages of references to all of the leading papers and writers on diverse technologies from mining to manufacturing to health and the household. The scope of subject coverage, granted mostly focused on western Europe and America, is truly impressive.
Mokyr deals with ‘useful knowledge,’ as he acknowledges Simon Kuznets‘ phrase. Mokyr argues that the growth of recent centuries was driven by the accumulation of knowledge and the declining costs of access to it. Mokyr helps to break past logjams that have attempted to link single factors such as the growth in science or the growth in certain technologies (such as the steam engine or electricity) as the key drivers of the massive increases in economic growth that coincided with the era now known as the Industrial Revolution.
Mokyr cracks some of these prior impasses by picking up on ideas first articulated through Michael Polanyi‘s “tacit knowing” (among other recent philosophers interested in the nature and definition of knowledge). Mokyr’s own schema posits propositional knowledge, which he defines as the science, beliefs or the epistemic base of knowledge, which he labels omega (Ω), in combination with prescriptive knowledge, which are the techniques (“recipes”), and which he also labels lambda (λ). Mokyr notes that an addition to omega (Ω) is a discovery, an addition to lambda (λ) is an invention.
One of Mokyr’s key points is that both knowledge types reinforce one another and, of course, the Industrial Revolution was a period of unprecedented growth in such knowledge. Another key point, easily overlooked when “discoveries” are seemingly more noteworthy, is that techniques and practical applications of knowledge can provide a multiplier effect and are equivalently important. For example, in addition to his main case studies of the factory, health and the household, he says:
The inventions of writing, paper, and printing not only greatly reduced access costs but also materially
affected human cognition, including the way people thought about their environment.
Mokyr also correctly notes how the accumulation of knowledge in science and the epistemic base promotes productivity and more still-more efficient discovery mechanisms:
The range of experimentation possibilities that needs to be searched over is far larger if the searcher knows nothing about the natural principles at work. To paraphrase Pasteur’s famous aphorism once more, fortune may sometimes favor unprepared minds, but only for a short while. It is in this respect that the width of the epistemic base makes the big difference.
In my own opinion, I think Mokyr starts to get closer to the mark when he discusses knowledge “storage”, access costs and multiplier effects from basic knowledge-based technologies or techniques. Like some other recent writers, he also tries to find analogies with evolutionary biology. For example:
Much like DNA, useful knowledge does not exist by itself; it has to be “carried” by people or in storage
devices. Unlike DNA, however, carriers can acquire and shed knowledge so that the selection process is quite different. This difference raises the question of how it is transmitted over time, and whether it can actually shrink as well as expand.
One of the real advantages of this book is to move forward a re-think of the “great man” or “great event” approach to history. There are indeed complicated forces at work. I think Mokyr summarizes well this transition when he states:
A century ago, historians of technology felt that individual inventors were the main actors that brought about
the Industrial Revolution. Such heroic interpretations were discarded in favor of views that emphasized deeper economic and social factors such as institutions, incentives, demand, and factor prices. It seems, however, that the crucial elements were neither brilliant individuals nor the impersonal forces governing the masses, but a small group of at most a few thousand peopled who formed a creative community based on the exchange of knowledge. Engineers, mechanics, chemists, physicians, and natural philosophers formed circles in which access to knowledge was the primary objective. Paired with the appreciation that such knowledge could be the base of ever-expanding prosperity, these elite networks were indispensible, even if individual members were not. Theories that link education and human capital of technological progress need to stress the importance of these small creative communities jointly with wider phenomena such as literacy rates and universal schooling.
There is so much to like and to be impressed with this book and even later Mokyr writings. My two criticisms are that, first, I found the pseudo-science of his knowledge labels confusing (I kept having to mentally translate the omega symbol) and I disliked the naming distinctions between propositional and prescriptive, even though I think the concepts are spot on.
My second criticism, a more major one, is that Mokyr notes, but does not adequately pursue, “In the decades after 1815, a veritable explosion of technical literature took place. Comprehensive technical compendia appeared in every industrial field.” Statements such as these, and there are many in the book, hint at perhaps some fundamental drivers.
Mokyr has provided the raw grist for answering his starting question of why such massive economic growth occurred in conjunction with the era of the Industrial Revolution. He has made many insights and posited new factors to explain this salutary discontinuity from all prior human history. But, in this reviewer’s opinion, he still leaves the why tantalizingly close but still unanswered. The fixity of information and growing storehouses because of declining production and access costs remain too poorly explored.
This Friday brown bag leftover was first placed into the AI3 refrigerator about four years ago on July 6, 2006. It was part of a series of book reviews I was doing at that time getting at the importance of bulk paper production as a key enabler of economic growth. No changes have been made to the original posting.