Posted:November 16, 2012

Brown Bag Lunch: Of Flagpoles and Fishes

The New Paradigm of ‘Substantive Marketing’ for Innovative IT

This decade has clearly marked a sea change in the move of enterprise software from proprietary to open source, as I have recently discussed [1]. It is instructive that only a mere six years ago I was in heated fights with my then Board about open source; today, that seems so quaint and dated.

Also during this period many have noted how open source has changed the capital required to begin a new software startup [2]. Open source both provides the tooling and the components for cobbling together specialty apps and extensions. Six and seven and even eight figure startup costs common just a decade ago have now dropped to four or five figures. When we see the explosion of hundreds of thousands of smartphone apps we are seeing the glowing residue of these additional sea changes. Dropping startup costs by one to three orders of magnitude is truly democratizing innovation.

But something else has been going on that is changing the face of enterprise software (besides consolidation, another factor I also recently commented on). And that factor is “marketing”. Much less commentary is made about this change, but it, too, is greatly lowering costs and fundamentally changing market penetration strategies. That topic — and my personal experience with it — is the focus of this article.

This Friday brown bag leftover was first placed into the AI3 refrigerator on August 15, 2011. This reprise is unchanged from its original posting and still describes how Structured Dynamics undertakes its marketing.

The Obsolete Recent Past

Besides the few remaining big providers of enterprise software — like IBM, Oracle, HP, SAP — most vendors have totally remade their sales practices of just a few years ago. Large sales forces with big commissions and a year to two year sales cycles can no longer be justified when software license fees and the percentage maintenance annuities that flow from them are dropping rapidly. Today’s mantras are doing more with less and doing it faster, hardly consistent with the traditional enterprise software model. Sure, big enterprises, especially big government and big business, have large sunk costs in legacy systems that will continue to be milked by existing vendors. But the flow is constricting with longer-term trends clear to see. The old enterprise software model is obsolete.

Even if it were not dying, it is hard to square huge investments in sales and marketing when product development has become inexpensive and agile. The proliferation of three-letter marketing acronyms for branding “new” product areas and standard formulas for product hype of just a few years ago also feels old and dated. Cozy relationships with conventional trade press pundits and market analysts seem to be diminishing in importance, possibly because the authoritativeness of their influence is also diminishing. It is harder to justify market firm subscription costs when priority budget items are being cut and new information outlets have emerged.

In response to this, many developers have forsaken the enterprise market for the consumer one. Indeed enterprises themselves are looking more and more to the consumer sector and commodity apps for innovation and answers. But, still, problems unique to enterprises remain and how to effectively reach them in this brave new world is today’s marketing problem for enterprise software vendors.

Most entities today, when opining about these challenges, tend to emphasize the need for “laser focus” and “rifle-shot” targeting of prospects. The advice takes the form of: 1) emphasize well-defined verticals; 2) know your market well; and 3) target and go after your likely prospects. Prospect data mining and targeted ad analysis are the proferred elixirs.

But, there is little evidence such refined methods for prospect identification and targeting are really working. Like politicians doing focus groups and opinion polling to capture the desired “message” of their potential electorates, these are all still “push” models of marketing. Yet we are swamped with pushed messages and marketing everywhere we turn. The model is failing.

Besides message overload, there are two issues with laser targeting. First, despite all that we try to know about ready buyers (for enterprise software), we really don’t know if any particular individual is truly needful, in a position to buy, has the authority to buy, or is the right advocate to make the internal sell. Second, though the idea of “laser” carries with it the image of focus and not flailing, it is in fact expensive to identify the targets and send a focused message their way. Because of these issues, decay rates for laser prospects throughout conventional sales pipelines continue to rise.

A New Marketing Paradigm

There has always been the phenomenon of the “fish jumping into the boat“; that is, the unanticipated inbound inquiry from a previously unknown prospect leading to a surprisingly swift sale. But we have seen this phenomenon increase markedly in recent years. Structured Dynamics‘ current customer base — including recurring customers — comes almost exclusively from this source. As we have noted this trend in comparison with more targeted outreach, we have spent much time trying to understand why it is occurring and how we can leverage what Peter Drucker called the “unexpected success” [3].

What we are seeing, I believe, is a shift from sales to marketing, and within marketing from direct or outbound marketing to a new paradigm of marketing. Others have likened this to inbound marketing [4] or content marketing [5] or permission marketing [6]. What we are seeing at Structured Dynamics bears many resemblances to parts of what is claimed for these other approaches, but not all. And, it is also true that what we are seeing may pertain mostly to innovative IT for emerging enterprise markets, and not a generalized paradigm suitable to other products or markets.

For lack of a better term, what we are seeing we can term “substantive marketing”. By this we mean offering valuable content and solutions-oriented systems for free and without restriction. This shares aspects with content marketing. Then, in keeping with the trend for buyers doing their own research and analysis to fulfill their own needs, similar to the premises of inbound or permission marketing, potential consumers can make their own judgments as to relevance and value of our offerings.

Sometimes, of course, some prospects find our approaches and solutions lacking. Sometimes, they may grab what we have offered for free and use them on their own without compensation to us. But where the match is right — and we need to be honest with both ourselves and the customer when it is not — we can better spend the customer’s limited time and resources to tailor our generic solutions to their specific needs. In doing so, we offer higher value (tailored services) while learning better about another spectrum of consumer need that can virtuously enhance our substantive offerings for the next prospect.

So, let’s decompose these components further to see what they can tell us about this new practice of substantive marketing and how to use it as an engine for moving forward.

The Virtuous Cycle Begins with Substantive Solutions

The premise of substantive marketing is to offer square-deal value to the marketplace in the form of solutions-based content. Like content marketing that offers “the creation or sharing of content for the purpose of engaging current and potential consumer bases” [5], substantive marketing goes even further. The whole basis and premise of the approach is to provide substantive content, in one of more of these areas, preferably all:

Knowledge — this substantive area includes papers, commentary, survey results or listings of tools and references useful to the target market
Analysis — this content area includes unique analysis of market trends, data, technologies or reviews that pertain to the target market
Code — this area relates to the provision of open source code and tools, preferably under licenses that allow users to use the software without restriction (two examples are the Apache 2 license and the MIT license)
Documentation — a critical substantive area is the documentation in how to install, use, modify or customize these tools, including a prejudice to APIs and tutorial information
Methodologies, workflows and best practices — it is important to also discuss how to properly operate and utilize these tools and information. Taking care to document lessons learned and best practices also helps the user community avoid common mistakes and to speed adoption and utility, and
Demos — this area involves setting up (and sharing code and procedures for same) demos that show how the code and its methods actually work. Demos also become first use cases to aid the new user in learning and setting up the code bases.

Further, this substantive content is offered without strings, restrictions or customer fill-in forms. The content is not a come on or a teaser. We are not trying to gather leads or prospect names, because we have no intent to dun them with emails or follow-ups.

This substantive content is as complete as can be to enable new users to adopt the information and tools in their current state without further assistance. (In some cases, the information also educates the marketplace in order to prepare future customers for adoption.) Most importantly, this substantive content is offered for free, either open source (for code) or creative commons for documentation and other content. In return, it is fair to request — and we do — attribution when this material is used.

We have previously termed this complete panoply of substantive content a total open solution [7]. Some might find the provision of such robust information crazy: How can we give away the store of our proprietary knowledge and systems?

But we find this kind of thinking old school. In an open source world where so much information is now available online, with a bit of effort customers can find this information anyway. Rather, our mindset is that customers do not want to pay again for what has already been done, but are willing to pay for what can be done with that knowledge for their own specific problems. Offering the complete storehouse of our knowledge in fact signals our interest in only charging the customer for new answers, new value or new formulations. The customers we like to work with feel they are getting an honest, square deal.

Flagpole Venues Help Increase Awareness

Consider your substantive content to be your flag, a unique banner for conveying and packaging your specific brand. It is thus important to find appropriate flagpoles — in the virtual territories that your customers visit — for raising this content high for them to see. Since the role of these flagpoles is to create awareness in potential prospects — who you do not likely know individually or even by group in advance — it makes sense to raise your offerings up on many flagpoles and on the highest flagpoles. Visibility is the object of the approach.

This approach is distinctly not leafletting or cramming links or emails into as many spaces as possible. The idea of substantive marketing is to fly valuable content high enough that desirous potential customers can discover and then inspect the information on their own, and only if they so choose. In this regard, substantive marketing resembles permission marketing [6].

Being visible helps ensure that the needful, questing prospect that you would never have been able to target on your own is able to see and be aware of your offerings. And, since they are seeking information and answers, your collateral needs to be of a similar nature. Solutions and substance are what they are seeking; what you have run up the flagpole should respond to that.

The mindset here is to respect your prospective customers and to allow them to chose to receive and inspect your offerings, but only if they so choose. If flown in the right venues with the right visibility, customers will see your flags and inspect them if they meet their requirements.

Some of the venues at which you can raise your flags include:

Blogs — this venue is especially helpful, since you have complete control over content, message, voice and packaging
Social networks — the value of social networks is now accepted, and should be a core component of any visibility strategy. However, it is also important to make sure that your contributions are driven by substance and value and do not become part of the cacophonous background noise
Vertical media — there are always existing outlets well-read and -respected by your customer propects. Establishing relationships and value with these third-party outlets can extend your reach
Web sites — this venue includes your standard Web sites, of course. But, you should also consider setting up specific project-related sites or sites dedicated to documentation (c.f., our TechWiki site of 300+ technical articles) or to methodologies (the excellent MIKE2.0 site is one great example) or to other ways by which particular content (such as tools with the Sweet Tools site) can raise another flag
User forums — user discussion groups and forums also become their own attractants for like-interested prospects, and
Conferences and tradeshows — while potentially valuable, presence at conferences and tradeshows must be carefully evaluated. Since participation and opportunity costs are high, the venues should be clearly relevant to your market space with likely decision makers in attendance.

The observant reader will have already concluded that each of these venues develops slowly, and therefore raising visibility is generally a slow-and-steady game that requires patience. Start-up vendors backed by venture firms or those looking for quick visibility and cashout will not find this approach suitable. On the other hand, customer prospects looking for answers and self-sustaining solutions are not much interested in flash in the pan vendors, either.

A Model Responsive to the Changing Nature of Customer Prospects

The real drivers for this changing paradigm come from customer prospects. Sophisticated buyers of enterprise IT and instrumental change agents within organizations share most if not all of these characteristics:

They are inundated with marketing messages and jaded about hype and “pushed” messages
They are generally knowledgeable about their needs and problem spaces and about approximate technologies. They are eager and desirous of learning independently and know that their recommendations affect their personal reputations and standing within their enterprises
With the many volatile external and internal changes, including staff reductions and fluid assignments, leadership for new technology adoption can come from many different and unknown corners of the organization; it is extremely difficult to identify and target prospects
The economic and competitive environment places a premium on affordability and low-risk evaluations of new technologies
Lock-ins of any kind — be it to specific vendors or technologies — are understood as inherently risky. This understanding is raising the importance of open and standards-based approaches
Being the subject of a pushy sales effort is distasteful and a negative to an eventual sale. Education and learning, however, is respected
Because of all that is at stake, honesty with no bullshit is highly appreciated. If you as a vendor do not offer an appropriate solution or have fulfillment weaknesses, tell the prospect so. Further, tell them who can supply the solution. One never knows when and where the next problem may arise, and providing trustworthy advice can lead to later engagements.

More often than not we find our customers to have already installed and used our existing substantive materials for some time before they approach us about further work. They appreciate the tutorial information and have taught themselves much in advance. By the time we engage, both parties are able to cost-effectively focus on what is truly missing and needed and to deliver those answers in a quick way. Re-engagements tend to occur when a next set of gaps or challenges arise.

Though it may sound trite or even unbelievable to those who have not yet experienced such a relationship, the square deal value offered by substantive marketing can really lead to true partnerships and trust between vendor and customer. We experience it daily with our customers, and vice versa. We also think this is the adaptive approach that our new environment demands.

The Free Path to Open Source and Solutions

Once prospects learn of our substantive offerings, many may decide independently that what we have is not suitable. Others may simply download and use the information on their own, for which we often never know let alone receive revenue. We are completely fine with this, as shown for three different cases.

First, some of these prospects need no more than what we already have. This increases our user base, increases our visibility and often results in contributions to our forums and documentation.

Then, some of these prospects come to learn they need or want more than what our current offerings provide, leading to two possible forks. In one fork, the second case, they may have sufficient skills internally or with other suppliers to extend the system on their own. Some of this flows back to an improved code base or improved installation or documentation bases.

In the other fork, the third case, they may decide to engage us in tailoring a solution for them. That case is the only one of the three that leads to a direct revenue path.

In all three cases we win, and the customer wins. Maybe enterprise software vendors of decades past rue this reality of lower margins and shared benefits; we agree that the absolute profit potential of substantive marketing is much less. But we gladly accept the more enjoyable work and steady revenue relationships resulting from these changes. We are not engaged in some pollyann-ish altruism here, but in a steely-eyed honest brokering that best serves our own self-interest (and fairly that of the customer, as well).

A Square Deal Baseline for Tailored Services

Great IT product does not come from idle musings or dreamed up functionality. It comes solely and directly from solving customer problems. Only via customers can software be refined and made more broadly usable.

A slipstream of those who have previously become aware and tested our offerings will choose to engage our services. This generally takes the form of an inbound call, where the prospect not only qualifies itself, but also establishes the terms and conditions for the sale. They have chosen to select us; they are fish that have jumped into the boat.

To again quote Peter Drucker, “. . . the aim of marketing is to make selling superfluous. The aim of marketing is to know and understand the customer so well that the product or service fits him and sells itself. Ideally, marketing should result in a customer who is ready to buy. All that should be needed then is to make the product or service available . . .” [8]. This is precisely what I meant earlier about the shift in emphasis from sales to marketing.

Even at this point there may be mismatches in needs and our skills and availabilities. If such is the case, we do not hesitate to say so, and attempt to point the prospect in another direction (from which we also gain invaluable market knowledge). If there is indeed a match, we then proceed to try to find common ground on schedule and budget.

Paradoxically, this square deal and honesty about the readiness and weaknesses of our offerings often leads to forgiveness from our customers. For example, for some time we have lacked automated installation scripts that would make it easier for prospects to install our open semantic framework. But, because of compensating value in other areas, such gaps can be overlooked and tackled later on (indeed, as a current customer is now funding). By not pretending to be everything to everyone, we can offer what we do have without embarrassment and get on with the job of solving problems.

For larger potential engagements, we typically suggest a fixed price initial effort to develop an implementation plan. The interviews and research to support this typical 4- to 6-weeks effort (generally in the $5 K to $10 K range, depending) then result in a detailed fulfillment proposal, with firm tasks, budget and schedule, specific to that customer’s requirements. Just as we respect our prospects’ time and budget, we expect the same and do not conduct these detailed plans without compensation. With respect to fulfillment contracts, we cap contract amount and limit milestone payments to pre-set percentages or time expended, whichever is lower.

This approach ensures we understand the customer’s needs and have budgeted and tasked accordingly. Capped contracts also put the onus on us the contractor to understand our own effort and tasking structures and realities, which leads to better future estimating. For the customer, this approach caps risk and potential exposure, and ensures milestones are being met no matter the time expenditures by us, the contractor. This approach extends our square-deal basis to also embrace risks and payments.

New (and Open Source) Developments Fuel the Substance Pipeline

Thus, when customers engage us, they spend almost solely on new functionality specifically tailored to their needs. In doing so, we suggest they agree to release the new developments they fund as open source. We argue — and customers predominantly agree — that they are already benefitting from lower overall costs because other customers have funded sharable, open source before them. We point out that the new customers that follow them will also be independently creating new functionality, to which they will also later benefit.

(This argument does not apply to specific customer data or ontologies, which are naturally proprietary to the customer. Also, if the customer wants to retain intellectual ownership of extensions, we charge higher development fees.)

Once these new developments are completed, they are fed back into a new baseline of valuable content and code. From this new baseline the cycle of substantive marketing can be augmented anew and perpetuated.

Three Guidelines to Leverage Substantive Marketing

All of these points can really be boiled down to three guidelines for how to make substantive marketing effective:

First, whatever your domain or market, provide useful and substantive content. The content you offer is indeed your marketing collateral. Prospective customers can gauge from it directly whether it meets their needs, appears sound and workable, and has value. If you have little of substance to offer, this paradigm is not for you
Second, plant many flagpoles and raise your flags high in territories your market prospects are likely to visit. This is a process that requires thoughtfulness and patience. Thoughtfulness, because that is how you determine where to plant your flags. If you yourself are a consumer of what you offer, it is easier to find those venues. And patience, because it takes time to stack valuable content upon valuable content in order to raise visibility
And, third, be honest and respectful. Help your prospect work within available budget to achieve the most possible at lowest risk. And help them find others, if need be, who might be better able than you to truly solve their problems.

What we are finding — as we continue to refine our understanding of this new paradigm — is that through substantive marketing the fish are finding us and they sometimes jump into the boat. We like our enterprise customers to pre-qualify themselves and already be “sold” once they knock on the door. One never knows when that phone might ring or the email might come in. But when it does, it often results in a collaborative customer as a partner who is a joy to work with to solve exciting new problems.

[1] M.K. Bergman, 2011. “Declining IT Innovation in the Enterprise,” in AI3:::Adaptive Innovation blog, January 17, 2011. See https://www.mkbergman.com/940/declining-it-innovation-in-the-enterprise/.

[2] Paul Graham has been the most prominent observer of this scene; see P. Graham, 2008. “Why There Aren’t Any More Googles,” April 2008 (see http://www.paulgraham.com/googles.html) and subsequent articles.

[3] See esp. Peter F. Drucker, 1985. Innovation and Entrepreneurialship: Practice and Principals, Harper & Row, New York, NY, 277 pp.

[4] Inbound marketing is a marketing strategy that focuses on getting found by customers. According to David Meerman Scott, inbound marketers “earn their way in” (via publishing helpful information on a blog etc.) in contrast to outbound marketing where they used to have to “buy, beg, or bug their way in” (via paid advertisements, issuing press releases in the hope they get picked up by the trade press, or paying commissioned sales people, respectively). Brian Halligan, cofounder and CEO of HubSpot, claims he first coined the term of inbound marketing.

[5] Content marketing is an umbrella term encompassing all marketing formats that involve the creation or sharing of content for the purpose of engaging current and potential consumer bases. In contrast to traditional marketing methods that aim to increase sales or awareness through interruption techniques, content marketing subscribes to the notion that delivering high-quality, relevant and valuable information to prospects and customers drives profitable consumer action. See also Holger Shulze, 2011. B2B Content Marketing Trends slideshow, see http://www.slideshare.net/hschulze/b2b-content-marketing-report.

[6] Seth Godin coined the term permission marketing wherein marketers obtain permission before advancing to the next step in the purchasing process. It is mostly used by online marketers, notably email marketers and search marketers, as well as certain direct marketers who send a catalog in response to a request. Godin contrasts this approach to traditional “interruption marketing” where messages are sent without prior permission.

[7] See the three-part series, M.K. Bergman, 2010. “Listening to the Enterprise: Total Open Solutions,” “Part 1,” “Part 2” and “Part 3,” AI3:::Adaptive Information blog, May 12 – 31, 2010.

[8] Peter F. Drucker, 1974. Management: Tasks, Responsibilities, Practices. New York, NY: Harper & Row. pp. 864. ISBN 0-06-011092-9.

[9] The intro photo is of the world’s tallest flagpole (at 165 m), in Dushanbe, Tajikistan. The photo is courtesy of CentralAsiaOnline.com.

Posted:September 10, 2012

We Are an Open World

The Foundation of Knowledge Applications Should Reflect Their Nature

Every couple of months I return to the idea of the open world assumption (OWA) [1] and its fundamental importance to knowledge applications. What it is that makes us human — in health and in sickness — is but a further line of evidence for the importance of an open world viewpoint. I’ll use three personal anecdotes to make this case.

Cell Symbionts

Believe it or not, Alfred Wegener‘s theory of continental drift was only becoming accepted by mainstream scientists in my high school years. I experienced déjà vu regarding a science revolution while a botany major at Pomona College in the early 1970s. A young American biologist at that time, Lynn Margulis, was postulating the theory of endosymbiosis; that is, that certain cell organelles originated from initially free-living bacteria.

This idea of longstanding symbionts in the cell — indeed, even forming what was our overall conception of cells and their parts — was truly revolutionary. It was revolutionary because of the implications for the nature and potential degree of symbiosis. And it was revolutionary in adding a different arrow in the quiver of biotic change over time than classical Darwinian evolution.

Today, Margulis’ theory is now widely accepted and is understood to embrace cell organelles from mitochondria to chloroplasts and ribosomes. The seemingly fundamental unit of all organisms — the cell — is itself an amalgam of archaic symbionts and bacteria-like lifeforms. Truly remarkable.

The Vanishing Ulcer

In the early 1990s, my oldest child, Erin, then in elementary school, had been going through a debilitating bout of periodic and severe stomach upsets. I sort of thought this might be inherited, since my paternal grandmother had suffered from ulcers for many decades (as did many at that time).

We were good friends with our pediatrician in our small town and knew him to be a thoughtful and well-informed MD. His counsel was that Erin was likely suffering from an ulcer and we began taking great care about her diet. But Erin’s symptoms did not seem to improve.

My wife, Wendy, is a biomedical researcher and began to investigate this problem on her own. She discovered some early findings implicating a gastrointestinal (gut) bacteria with similar symptoms and brought this research to our doctor’s attention. He, too, was intrigued, and prescribed a rather straightforward antibiotic regimen for Erin. Her symptoms immediately ceased, and she has been clear of further symptoms in the twenty years since.

The nearly universal role of the Helicobacter bacteria in ulcers is now widely understood. The understanding of peptic ulcers that had stood for centuries no longer applies in most cases. Though ulcers may arise from many other conditions, because of these new understandings the prevalence and discussion of ulcers has nearly fallen off the radar screen.

Humans as Walking Ecosystems

A few years back I began to show symptoms of rosacea, a facial skin condition characterized by redness. My local dermatologist recommended a daily dose of antibiotics as the preferred course of action. I was initially reluctant to follow this advice. I knew about the growing problem of bacterial resistance, and did not think that my constant use of tetracycline would help that issue. I also knew some about the controversial use of antibiotics in animal feeds, and had hesitations for that reason as well.

Nonetheless, I took the doctor’s advice. I rarely take any kind of medicine and immediately began to notice GI problems. My digestive regularity was immediately thrown out of kilter with other adverse effects as well. I immediately stopped using the antibiotics, and soon returned to (largely) my pre-regime conditions. (I also switched doctors.)

Over the past five years, due to a revolution in DNA sequencing [2], we are now beginning to understand the why of my observed reactions to antibiotics. Because we can now analyze skin and fecal samples for foreign DNA, we are coming to realize that humans (as is likely true for all higher organisms) are walking, teeming ecosystems of thousands of different species, mostly bacteria [3].

While there are some 23,000 genes in the native humane genome, there are more than 3 million estimated as arising from these fellow travelers. While we are still learning much, and rapidly, we know that our ecosystem of bacteria is involved in nutrition and digestion, contributing perhaps as much as 15% of the energy value we get from food. We also know that imbalances of various sorts in our walking ecosystem can also lead to diseases and other chronic conditions.

Though the degree and nature is still quite uncertain, our “microbiome” of symbiotic bacteria has been implicated in heart disease, Type II diabetes, obesity, malnutrition, multiple sclerosis, other auto-immune diseases, asthma, eczema, liver disease, bowel cancer and autism, among others. The breadth and extent of implications on well-being is staggering, especially since all of these implications have been learned over the past five years.

There are considerable differences between different human populations and cultures, too, in terms of differing compositions of the microbiome. And these effects are not limited to the gut. Skin and orifices to the outside world have their own denizens as well, likely also involved with both health and disease. Humans are not just complicated beasts, but a world of other species unique unto ourselves.

We Are Not Yet an Open Book, But We Are an Open World

Each of these three anecdotes — and there are many others — point to phenomenal changes in our understanding of the human organism. This new knowledge has also arisen over a remarkably short period. Who knows when the pace of these insights might slow, if ever?

These anecdotes are exemplary about the fundamental nature of knowledge: it is constantly expanding with new connections and heretofore unforeseen relationships constantly emerging. These anecdotes also point to the fact that most knowledge problems are systems problems, intimately involved with the connections and inter-relationships among a diversity of players and factors.

It makes sense that how we choose to organize and analyze the information that constitutes our knowledge should have a structure and underlying logic premise consistent with expansion and new relationships. This premise is the central feature of the open world assumption and semantic Web technologies.

Fixed, closed, brittle schema of transaction systems and relational databases are a clear mismatch with knowledge problems and knowledge applications. We need systems where schema and structure can evolve with new information and knowledge. The foundational importance of open world approaches to understanding and modeling knowledge problems continues to be the elephant in the room.

It is perhaps not surprising that one of the fields most aggressive in embracing ontologies and semantic technologies is the life sciences. Practitioners in this field experience daily the explosion in new knowledge and understandings. Knowledge workers in other fields would be well-advised to follow the lead of the life sciences in re-thinking their own foundations for knowledge representation and management. It is good to remember that if your world is not open, then your understanding of it is closed.

[1] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.

OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption.

Also, you can search on OWA on this blog.

[2] Automatic DNA sequencing machines now allow direct samples to be sequenced without the need to grow up cultures of organisms. This advance has freed up the ability to take direct samples — such as from soil, seawater, skin, feces or secretions — to identify all DNA present. DNA not matching a host organism or which matches patterns for other known organisms then allows the presence of foreign organisms to be identified.

[3] An excellent piece for lay readers providing more background on this topic may be found in “The Human Microbiome: Me, Myself, Us,” in The Economist, August 18, 2012, pp. 69-72.

Posted:August 12, 2012

The Age of the Graph

The Transition from Transactions to Connections

Download PDF

Virtually everywhere one looks we are in the midst of a transition for how we organize and manage information, indeed even relationships. Social networks and online communities are changing how we live and interact. NoSQL and graph databases — married to their near cousin Big Data — are changing how we organize and store information and data. Semantic technologies, backed by their ontologies and RDF data model, are showing the way for how we can connect and interoperate disparate information in ways only dreamed about a decade ago. And all of this, of course, is being built upon the infrastructure of the Internet and the Web, a global, distributed network of devices and information that is undoubtedly one of the most important technological developments in human history.

There is a shared structure across all of these developments — the graph. Graphs are proving to be the new universal paradigm for how we organize and manage information. Graphs have an inherently expandable nature, and one which can also capture any existing structure. So, as we see all of the networks, connections, relationships and links — both physical and informational — grow around us, it is useful to step back a bit and contemplate the universal graph structure at the core of these developments.

Understanding that we now live in the Age of the Graph means we can begin studying and using the concept of the graph itself to better analyze and manage our interconnected world. Whether we are trying to understand the physical networks of supply chains and infrastructure or the information relationships within ontologies or knowledge graphs, the various concepts underlying graphs and graph theory, themselves expressed through a rich vocabulary of terms, provide the keys for unlocking still further treasures hidden in the structure of graphs.

Graphs as a Concept

The use of “graph” as a mathematical concept is not much more than 100 years old. The beginning explication of the various classes of problems that can be addressed by graph theory probably is no older than 300 years. The use of graphs for expressing logic structures probably is not much older than 100 years, with the intellectual roots beginning with Charles Sanders Peirce [1]. Though likely trade routes and their affiliated roads and primitive transportation or nomadic infrastructures were perhaps the first expressions of physical networks, the emergence and prevalence of networks is a fairly recent phenomenon. The Internet and the Web are surely the catalyzing development that has brought graphs and networks to the forefront.

In mathematics, a graph is an abstract representation of a set of objects where pairs of the objects are connected. The objects are most often known as nodes or vertices; the connections between the objects are called edges. Typically, a graph is depicted in diagrammatic form as a set of dots or bubbles for the nodes, joined by lines or curves for the edges. If there is a logical relationship between connected nodes the edge is directed, and the graph is known as a directed graph. Various structures or topologies can be expressed through this conceptual graph framework. Graphs are one of the principle focuses of study in discrete mathematics [2]. The word “graph” was first used in the sense as a mathematical structure by J.J. Sylvester in 1878 [3].

As representative of various data models, particularly in our company’s own interests in the Resource Description Framework (RDF) model, the nodes can represent “nouns” or subjects or objects (depending on the direction of the links) or attributes. The edges or connections represent “verbs” or relationships, properties or predicates. Thus, the simple “triple” of the basic statement in RDF (consisting of subject – predicate – object) is one of the constituent barbells that make up what becomes the eventual graph structure.

The manipulation and analysis of graph structures comes under the rubric of graph theory. The first recognized paper in that field is the Seven Bridges of Königsberg, written by Leonhard Euler in 1736. The objective of the paper was to find a walking path through the city that would cross each bridge once and only once. Euler proved that the problem has no solution:

Seven Bridges of Königsberg; from Wikipedia

–>

Seven Bridges of Königsberg graph; from Wikipedia

Euler’s approach represented the path problem as a graph, by treating the land masses as nodes and the bridges as edges. Euler’s proof postulated that if every bridge has been traversed exactly once, it follows that, for each land mass (except for the ones chosen for the start and finish), the number of bridges touching that land mass must be even (the number of connections to a node we now call “degree”). Since that is not true for this instance, there is no solution. Other researchers, including Leibniz, Cauchy and L’Huillier applied this approach to similar problems, leading to the origin of the field of topology.

Later, Cayley broadened the approach to study tree structures, which have many implications in theoretical chemistry. By the 20th century, the fusion of ideas coming from mathematics with those coming from chemistry formed the origin of much of the standard terminology of graph theory.

The Theory of Graphs

Graph theory forms the core of network science, the applied study of graph structures and networks. Besides graph theory, the field draws on methods including statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. Classical problems embraced by this realm include the four color problem of maps, the traveling salesman problem, and the six degrees of Kevin Bacon.

Graph theory and network science are the suitable disciplines for a variety of information structures and many additional classes of problems. This table lists many of these applicable areas, most with links to still further information from Wikipedia:

Graph Structures

Graph Problems

Data structures

Tree structures

List structures

Matrix structures

Path structures

Networks

Logic structures

Random graphs

Weighted graphs

Sparse/dense graphs

Enumeration

graphical enumeration

Subgraphs, induced subgraphs, and minors

Search and navigation

Graph coloring

four-color theorem
strong perfect graph theorem
Erdős–Faber–Lovász conjecture (unsolved)
total coloring conjecture (unsolved)
list coloring conjecture (unsolved)
Hadwiger conjecture (unsolved)

Subsumption and unification

operations between graphs
unification of graphs
automatic theorem proving
modeling linguistic structure

Route (path) problems

Hamiltonian path and cycle problems
minimum spanning tree
route inspection problem (also called the “Chinese Postman Problem”)
Seven Bridges of Königsberg
shortest path problem
Steiner tree
three-cottage problem
traveling salesman problem
critical path analysis
percolation

Matrix manipulations (many)

Network flow

max flow min cut theorem

Visibility graph problems

museum guard problem

Covering problems

Graph structure

Graph classes

enumerating the members of a class
characterizing a class in terms of forbidden substructures
ascertaining relationships among classes
deciding membership in a class
finding representations for members of a class

Graphs are among the most ubiquitous models of both natural and human-made structures. They can be used to model many types of relations and process dynamics in physical, biological and social systems. Many problems of practical interest can be represented by graphs. This breadth of applicability makes network science and graph theory two of the most critical analytical areas for study and breakthroughs for the foreseeable future. I touch on this more in the concluding section.

Graphs as Physical Networks

Surely the first examples of graph structures were early trade and nomadic routes. Here, for example, are the trade routes of the Radhanites dating from about 870 AD [4]:

Trade network of the Radhanites, c. 870 CE; from Wikipedia

It is not surprising that routes such as these, or other physical networks as exemplified by the bridges of Königsberg, were the stimulus for early mathematics and analysis related to efficient use of networks. Minimizing the time to complete a trade circuit or visiting multiple markets efficiently has clear benefits. These economic rationales apply to a wide variety of modern, physical networks, including:

Of course, included in the latter category is the Internet itself. It is the largest graph in existence, with an estimated 2.2 billion users and their devices all connected in one way or another in all parts of the globe [5].

Graphs as Natural Systems

Graphs and graph theory also have broad applicability to natural systems. For example, graph theory is used extensively to study molecular structures in chemistry and physics. A graph makes a natural model for a molecule, where vertices represent atoms and edges bonds. Similarly, in biology or ecology, graphs can readily express such systems as species networks, ecological relationships, migration paths, or the spread of diseases. Graphs are also proper structures for modeling biological and chemical pathways.

Some of the exemplar natural systems that lend themselves to graph structures include:

As with physical networks, a graph representation for natural systems provides real benefits in computer processing and analysis. Once expressed as a graph, all graph algorithms and perspectives from graph theory and network science can be brought to bear. Statistical methods are particularly applicable to representing connections between interacting parts of a system, as well to representing the physical dynamics of natural systems.

Graphs as Social Networks

Parallel with the growth of the Internet and Web has been the growth of social networks. Social network analysis (SNA) has arguably been the single most important driver for advances in graph theory and analysis algorithms in recent years. New and interesting problems and challenges — from influence to communities to conflicts — are now being elucidated through techniques pioneered for SNA.

Second only in size to the Internet has been the graph of interactions arising from Facebook. Facebook had about 900 million users as of May 2012, half of which accessed the service via mobile devices [6]. Facebook famously embraced the graph with its own Open Graph protocol, which makes it easy for users to access and tie into Facebook’s social network. A representation of the Facebook social graph as of December 2010 is shown in this well-known figure:

Facebook Users, December 2010; from The Future Buzz (http://thefuturebuzz.com)

The suitability of the graph structure to capture relationships has been a real boon to better understanding of social and community dynamics. Many new concepts have been introduced as the result of SNA, including such things as influence, diversity, centrality, cliques and so forth. (The opening diagram to this article, for example, models centrality, with blue the maximum and red the minimum.)

Particular areas of social interaction that lend themselves to SNA include:

Entirely new insights have arisen from SNA including finding terrorist leaders, analyzing prestige, or identifying keystone vendors or suppliers in business ecosystems.

Graphs as Information Representations

Given the ubiquity of graphs as representations of real systems and networks, it is certainly not surprising to see their use in computer science as as means for information representation. We already saw in the table above the many data structures that can be represented as graphs, but the paradigm has even broader applicability.

The critical breakthroughs have come through using the graph as a basis for data models and logic models. These, in turn, provide the basis for crafting entire graph-based vocabularies and languages. Once such structures are embraced, it is a natural extension to also extend the mindset to graph databases as well.

Some of the notable information representations that have a graph as their basis include:

Graph-structured data
Graph databases
RDF data model
OWL ontology language
Common logic
Conceptual graphs, and
Peirce’s symbolic or relational logic, as well as his use of triadic relations [1].

Graphs as Knowledge Representations

A key point of graphs noted earlier was their inherent extensibility. Once graphs are understood as a great basis for representing both logic and data structures, it is a logical next step to see their applicability extend to knowledge representations and knowledge bases as well.

Graph-theoretic methods have proven particularly useful in linguistics, since natural language often lends itself well to discrete structure. So, not only can graphs represent syntactic and compositional structure, but they can also capture the interrelationships of terms and concepts within those languages. The usefulness of graph theory to linguistics is shown by the various knowledge bases such as WordNet (in various languages) and VerbNet.

Domain ontologies are similar structures, capturing the relationships amongst concepts within a given knowledge domain. These are also known as knowledge graphs, and Google has famously just released its graph of entities to the world [7]. Semantic networks and neural networks are similar knowledge representations.

What all of these examples show is the nearly universal applicability of graphs, from the abstract to the physical, from the small to the large, and every gradation between. We also see how basic graph structures and concepts can be built upon with more structure. This breadth points to the many synergies and innovations that may be transferred from diverse fields to advance the usefulness of graph theories.

Graphs as a Guiding Paradigm

Despite the many advances that have occurred in graph theory and the increased attention from social network analysis, many, many graph problems remain some of the hardest in computation. Optimizations, partitioning, mapping, inferencing, traversing and graph structure comparisons remain challenging. And, some of these challenges are only growing due to the growth in the size of networks and graphs.

Applying the lessons of the Internet in such areas as non-relational databases, distributed processing, and big data and map reduce-oriented approaches will help some in this regard. We’re learning how to divide and conquer big problems, and we are discovering data and processing architectures more amenable to graph-based problems.

The fact we have now entered the Age of the Graph also bodes that further scrutiny and attention will lead to more analytic breakthroughs and innovation. We may be in an era of Big Data, but the structure underlying all of that is the graph. And that reality, I predict, will result in accelerated advances in graph theory.

[1] For a fairly broad discussion of Peirce in relation to these topics, see M.K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” in AI3:::Adaptive Innovation blog, January 24, 2012. See https://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/.

[2] Topics in discrete mathematics, which are all applicable to graphing techniques and theory, include theoretical computer science, information theory, logic, set theory, combinatorics, probability, number theory, algebra, geometry, topology, discrete calculus or discrete analysis, operations research, game theory, decision theory, utility theory, social choice theory, and all discrete analogues of continuous mathematics.

[3] See reference 1 in the Wikipedia entry on graph theory.

[4] According to Wikipedia, the Radhanites were medieval Jewish merchants involved in trade between the Christian and Islamic worlds during the early Middle Ages (approx. 500–1000 AD). Many trade routes previously established under the Roman Empire continued to function during that period largely through their efforts. Their trade network covered much of Europe, North Africa, the Middle East, Central Asia and parts of India and China.

[5] See the article on the Internet in Wikipedia for various size estimates.

[6] See the article on the Facebook in Wikipedia for various size estimates.

[7] For my discussion of the Google Knowledge Graph, see M.K. Bergman, 2012. “Deconstructing the Google Knowledge Graph,” in AI3:::Adaptive Innovation blog, May 18, 2012. See https://www.mkbergman.com/1009/deconstructing-the-google-knowledge-graph/.

[8] UMBEL (the Upper Mapping and Binding Exchange Layer) is designed to help content interoperate on the Web. It provides two functions: a) it is a broad, general reference structure of 25,000 concepts, which provides a scaffolding to link and interoperate other datasets and domain vocabularies, and b) it is a base vocabulary for the construction of other concept-based domain ontologies, also designed for interoperation.

Posted:July 9, 2012

Glossary of Semantic Technology Terms

Abrogans; earliest glossary (from Wikipedia)

There are many semantic technology terms relevant to the context of a semantic technology installation [1]. Some of these are general terms related to language standards, as well as to ontologies or the dataset concept.

ABox: An ABox (for assertions, the basis for A in ABox) is an “assertion component”; that is, a fact associated with a terminological vocabulary within a knowledge base. ABox are TBox-compliant statements about instances belonging to the concept of an ontology.

Adaptive ontology: An adaptive ontology is a conventional knowledge representational ontology that has added to it a number of specific best practices, including modeling the ABox and TBox constructs separately; information that relates specific types to different and appropriate display templates or visualization components; use of preferred labels for user interfaces, as well as alternative labels and hidden labels; defined concepts; and a design that adheres to the open world assumption.

Administrative ontology: Administrative ontologies govern internal application use and user interface interactions.

Annotation: An annotation, specifically as an annotation property, is a way to provide metadata or to describe vocabularies and properties used within an ontology. Annotations do not participate in reasoning or coherency testing for ontologies.

Atom: The name Atom applies to a pair of related standards. The Atom Syndication Format is an XML language used for web feeds, while the Atom Publishing Protocol (APP for short) is a simple HTTP-based protocol for creating and updating Web resources.

Attributes: These are the aspects, properties, features, characteristics, or parameters that objects (and classes) may have. They are the descriptive characteristics of a thing. Key-value pairs match an attribute with a value; the value may be a reference to another object, an actual value or a descriptive label or string. In an RDF statement, an attribute is expressed as a property (or predicate or relation). In intensional logic, all attributes or characteristics of similarly classifiable items define the membership in that set.

Axiom: An axiom is a premise or starting point of reasoning. In an ontology, each statement (assertion) is an axiom.

Binding: Binding is the creation of a simple reference to something that is larger and more complicated and used frequently. The simple reference can be used instead of having to repeat the larger thing.

Class: A class is a collection of sets or instances (or sometimes other mathematical objects) which can be unambiguously defined by a property that all of its members share. In ontologies, classes may also be known as sets, collections, concepts, types of objects, or kinds of things.

Closed World Assumption: CWA is the presumption that what is not currently known to be true, is false. CWA also has a logical formalization. CWA is the most common logic applied to relational database systems, and is particularly useful for transaction-type systems. In knowledge management, the closed world assumption is used in at least two situations: 1) when the knowledge base is known to be complete (e.g., a corporate database containing records for every employee), and 2) when the knowledge base is known to be incomplete but a “best” definite answer must be derived from incomplete information. See contrast to the open world assumption.

Data Space: A data space may be personal, collective or topical, and is a virtual “container” for related information irrespective of storage location, schema or structure.

Dataset: An aggregation of similar kinds of things or items, mostly comprised of instance records.

DBpedia: A project that extracts structured content from Wikipedia, and then makes that data available as linked data. There are millions of entities characterized by DBpedia in this way. As such, DBpedia is one of the largest — and most central — hubs for linked data on the Web.

DOAP: DOAP (Description Of A Project) is an RDF schema and XML vocabulary to describe open-source projects.

Description logics: Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.

Domain ontology: Domain (or content) ontologies embody more of the traditional ontology functions such as information interoperability, inferencing, reasoning and conceptual and knowledge capture of the applicable domain.

Entity: An individual object or member of a class; when affixed with a proper name or label is also known as a named entity (thus, named entities are a subset of all entities).

Entity–attribute–value model: EAV is a data model to describe entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. In the EAV data model, each attribute-value pair is a fact describing an entity. EAV systems trade off simplicity in the physical and logical structure of the data for complexity in their metadata, which, among other things, plays the role that database constraints and referential integrity do in standard database designs.

Extensional: The extension of a class, concept, idea, or sign consists of the things to which it applies, in contrast with its intension. For example, the extension of the word “dog” is the set of all (past, present and future) dogs in the world. The extension is most akin to the attributes or characteristics of the instances in a set defining its class membership.

FOAF: FOAF (Friend of a Friend) is an RDF schema for machine-readable modeling of homepage-like profiles and social networks.

Folksonomy: A folksonomy is a user-generated set of open-ended labels called tags organized in some manner and used to categorize and retrieve Web content such as Web pages, photographs, and Web links.

GeoNames: GeoNames integrates geographical data such as names of places in various languages, elevation, population and others from various sources.

GRDDL: GRDDL is a markup format for Gleaning Resource Descriptions from Dialects of Languages; that is, for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT.

High-level Subject: A high-level subject is both a subject proxy and category label used in a hierarchical subject classification scheme (taxonomy). Higher-level subjects are classes for more atomic subjects, with the height of the level representing broader or more aggregate classes.

Individual: See Instance.

Inferencing: Inference is the act or process of deriving logical conclusions from premises known or assumed to be true. The logic within and between statements in an ontology is the basis for inferring new conclusions from it, using software applications known as inference engines or reasoners.

Instance: Instances are the basic, “ground level” components of an ontology. An instance is individual member of a class, also used synonomously with entity. The instances in an ontology may include concrete objects such as people, animals, tables, automobiles, molecules, and planets, as well as abstract instances such as numbers and words. An instance is also known as an individual, with member and entity also used somewhat interchangeably.

Instance record: An instance with one or more attributes also provided.

irON: irON (instance record and Object Notation) is a abstract notation and associated vocabulary for specifying RDF (Resource Description Framework) triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF.

Intensional: The intension of a class is what is intended as a definition of what characteristics its members should have; it is akin to a definition of a concept and what is intended for a class to contain. It is therefore like the schema aspects (or TBox) in an ontology.

Key-value pair: Also known as a name–value pair or attribute–value pair, a key-value pair is a fundamental, open-ended data representation. All or part of the data model may be expressed as a collection of tuples <attribute name, value> where each element is a key-value pair. The key is the defined attribute and the value may be a reference to another object or a literal string or value. In RDF triple terms, the subject is implied in a key-value pair by nature of the instance record at hand.

Kind: Used synonomously herein with class.

Knowledge base: A knowledge base (abbreviated KB or kb) is a special kind of database for knowledge management. A knowledge base provides a means for information to be collected, organized, shared, searched and utilized. Formally, the combination of a TBox and ABox is a knowledge base.

Linkage: A specification that relates an object or attribute name to its full URI (as required in the RDF language).

Linked data: Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model, and uses uniform resource identifiers (URIs) to name the data objects. The approach exposes the data for access via the HTTP protocol, while emphasizing data interconnections, interrelationships and context useful to both humans and machine agents.

Mapping: A considered correlation of objects in two different sources to one another, with the relation between the objects defined via a specific property. Linkage is a subset of possible mappings.

Member: Used synonomously herein with instance.

Metadata: Metadata (metacontent) is supplementary data that provides information about one or more aspects of the content at hand such as means of creation, purpose, when created or modified, author or provenance, where located, topic or subject matter, standards used, or other annotation characteristics. It is “data about data”, or the means by which data objects or aggregations can be described. Contrasted to an attribute, which is an individual characteristic intrinsic to a data object or instance, metadata is a description about that data, such as how or when created or by whom.

Metamodeling: Metamodeling is the analysis, construction and development of the frames, rules, constraints, models and theories applicable and useful for modeling a predefined class of problems.

Microdata: Microdata is a proposed specification used to nest semantics within existing content on web pages. Microdata is an attempt to provide a simpler way of annotating HTML elements with machine-readable tags than the similar approaches of using RDFa or microformats.

Microformats: A microformat (sometimes abbreviated μF or uF) is a piece of mark up that allows expression of semantics in an HTML (or XHTML) web page. Programs can extract meaning from a web page that is marked up with one or more microformats.

Natural language processing: NLP is the process of a computer extracting meaningful information from natural language input and/or producing natural language output. NLP is one method for assigning structured data characterizations to text content for use in semantic technologies. (Hand assignment is another method.) Some of the specific NLP techniques and applications relevant to semantic technologies include automatic summarization, coreference resolution, machine translation, named entity recognition (NER), question answering, relationship extraction, topic segmentation and recognition, word segmentation, and word sense disambiguation, among others.

OBIE: Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. Ontology-based information extraction (OBIE) is the use of an ontology to inform a “tagger” or information extraction program when doing natural language processing. Input ontologies thus become the basis for generating metadata tags when tagging text or documents.

Ontology: An ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. Loosely defined, ontologies on the Web can have a broad range of formalism, or expressiveness or reasoning power.

Ontology-driven application: Ontology-driven applications (or ODapps) are modular, generic software applications designed to operate in accordance with the specifications contained in one or more ontologies. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies (namely as domain ontologies), as supplemented by UI and instruction sets and validations and rules.

Open Semantic Framework: The open semantic framework, or OSF, is a combination of a layered architecture and an open-source, modular software stack. The stack combines many leading third-party software packages with open source semantic technology developments from Structured Dynamics.

Open World Assumption: OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption. See contrast to the closed world assumption.

OPML: OPML (Outline Processor Markup Language) is an XML format for outlines, and is commonly used to exchange lists of web feeds between web feed aggregators.

OWL: The Web Ontology Language (OWL) is designed for defining and instantiating formal Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. There are also a variety of OWL dialects.

Predicate: See Property.

Property: Properties are the ways in which classes and instances can be related to one another. Properties are thus a relationship, and are also known as predicates. Properties are used to define an attribute relation for an instance.

Punning: In computer science, punning refers to a programming technique that subverts or circumvents the type system of a programming language, by allowing a value of a certain type to be manipulated as a value of a different type. When used for ontologies, it means to treat a thing as both a class and an instance, with the use depending on context.

RDF: Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata model but which has come to be used as a general method of modeling information, through a variety of syntax formats. The RDF metadata model is based upon the idea of making statements about resources in the form of subject-predicate-object expressions, called triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object.

RDFa: RDFa 1.0 is a set of extensions to XHTML that is a W3C Recommendation. RDFa uses attributes from meta and link elements, and generalizes them so that they are usable on all elements allowing annotation markup with semantics. A W3C Working draft is presently underway that expands RDFa into version 1.1 with HTML5 and SVG support, among other changes.

RDF Schema: RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources.

Reasoner: A semantic reasoner, reasoning engine, rules engine, or simply a reasoner, is a piece of software able to infer logical consequences from a set of asserted facts or axioms. The notion of a semantic reasoner generalizes that of an inference engine, by providing a richer set of mechanisms.

Reasoning: Reasoning is one of many logical tests using inference rules as commonly specified by means of an ontology language, and often a description language. Many reasoners use first-order predicate logic to perform reasoning; inference commonly proceeds by forward chaining or backward chaining.

Record: As used herein, a shorthand reference to an instance record.

Relation: Used synonomously herein with attribute.

RSS: RSS (an acronym for Really Simple Syndication) is a family of web feed formats used to publish frequently updated digital content, such as blogs, news feeds or podcasts.

schema.org: Schema.org is an initiative launched by the major search engines of Bing, Google and Yahoo!, and later jointed by Yandex, in order to create and support a common set of schemas for structured data markup on web pages. schema.org provided a starter set of schema and extension mechanisms for adding to them. schema.org supports markup in microdata, microformat and RDFa formats.

Semantic enterprise: An organization that uses semantic technologies and the languages and standards of the semantic Web, including RDF, RDFS, OWL, SPARQL and others to integrate existing information assets, using the best practices of linked data and the open world assumption, and targeting knowledge management applications.

Semantic technology: Semantic technologies are a combination of software and semantic specifications that encodes meanings separately from data and content files and separately from application code. This approach enables machines as well as people to understand, share and reason with data and specifications separately. With semantic technologies, adding, changing and implementing new relationships or interconnecting programs in a different way can be as simple as changing the external model that these programs share. New data can also be brought into the system and visualized or worked upon based on the existing schema. Semantic technologies provide an abstraction layer above existing IT technologies that enables bridging and interconnection of data, content, and processes.

Semantic Web: The Semantic Web is a collaborative movement led by the World Wide Web Consortium (W3C) that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a “web of data”. It builds on the W3C’s Resource Description Framework (RDF).

Semset: A semset is the use of a series of alternate labels and terms to describe a concept or entity. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept.

SIOC: Semantically-Interlinked Online Communities Project (SIOC) is based on RDF and is an ontology defined using RDFS for interconnecting discussion methods such as blogs, forums and mailing lists to each other.

SKOS: SKOS or Simple Knowledge Organisation System is a family of formal languages designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary; it is built upon RDF and RDFS.

SKSI: Semantic Knowledge Source Integration provides a declarative mapping language and API between external sources of structured knowledge and the Cyc knowledge base.

SPARQL: SPARQL (pronounced “sparkle”) is an RDF query language; its name is a recursive acronym that stands for SPARQL Protocol and RDF Query Language.

Statement: A statement is a “triple” in an ontology, which consists of a subject – predicate – object (S-P-O) assertion. By definition, each statement is a “fact” or axiom within an ontology.

Subject: A subject is always a noun or compound noun and is a reference or definition to a particular object, thing or topic, or groups of such items. Subjects are also often referred to as concepts or topics.

Subject extraction: Subject extraction is an automatic process for retrieving and selecting subject names from existing knowledge bases or data sets. Extraction methods involve parsing and tokenization, and then generally the application of one or more information extraction techniques or algorithms.

Subject proxy: A subject proxy as a canonical name or label for a particular object; other terms or controlled vocabularies may be mapped to this label to assist disambiguation. A subject proxy is always representative of its object but is not the object itself.

Tag: A tag is a keyword or term associated with or assigned to a piece of information (e.g., a picture, article, or video clip), thus describing the item and enabling keyword-based classification of information. Tags are usually chosen informally by either the creator or consumer of the item.

TBox: A TBox (for terminological knowledge, the basis for T in TBox) is a “terminological component”; that is, a conceptualization associated with a set of facts. TBox statements describe a conceptualization, a set of concepts and properties for these concepts. The TBox is sufficient to describe an ontology (best practice often suggests keeping a split between instance records — and ABox — and the TBox schema).

Taxonomy: In the context of knowledge systems, taxonomy is the hierarchical classification of entities of interest of an enterprise, organization or administration, used to classify documents, digital assets and other information. Taxonomies can cover virtually any type of physical or conceptual entities (products, processes, knowledge fields, human groups, etc.) at any level of granularity.

Topic: The topic (or theme) is the part of the proposition that is being talked about (predicated). In topic maps, the topic may represent any concept, from people, countries, and organizations to software modules, individual files, and events. Topics and subjects are closely related.

Topic Map: Topic maps are an ISO standard for the representation and interchange of knowledge. A topic map represents information using topics, associations (similar to a predicate relationship), and occurrences (which represent relationships between topics and information resources relevant to them), quite similar in concept to the RDF triple.

Triple: A basic statement in the RDF language, which is comprised of a subject – property – object construct, with the subject and property (and object optionally) referenced by URIs.

Type: Used synonomously herein with class.

UMBEL: UMBEL, short for Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts, designed to provide common mapping points for relating different ontologies or schema to one another, and a vocabulary for aiding that ontology mapping, including expressions of likelihood relationships distinct from exact identity or equivalence. This vocabulary is also designed for interoperable domain ontologies.

Upper ontology: An upper ontology (also known as a top-level ontology or foundation ontology) is an ontology that describes very general concepts that are the same across all knowledge domains. An important function of an upper ontology is to support very broad semantic interoperability between a large number of ontologies that are accessible ranking “under” this upper ontology.

Vocabulary: A vocabulary in the sense of knowledge systems or ontologies are controlled vocabularies. They provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems.

WordNet: WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools can be downloaded and used freely. Multiple language versions exist, and WordNet is a frequent reference structure for semantic applications.

YAGO: “Yet another great ontology” is a WordNet structure placed on top of Wikipedia.

[1] This glossary is based on the one provided on the OSF TechWiki. For the latest version, please refer to this link.

Posted:July 2, 2012

The Rationale for Semantic Technologies

Conventional IT Systems are Poorly Suited to Knowledge Applications

Download PDF

Frequently customers ask me why semantic technologies should be used instead of conventional information technologies. In the areas of knowledge representation (KR) and knowledge management (KM), there are compelling reasons and benefits for selecting semantic technologies over conventional approaches. This article attempts to summarize these rationales from a layperson perspective.

It is important to recognize that semantic technologies are orthogonal to the buzz around some other current technologies, including cloud computing and big data. Semantic technologies are also not limited to open data: they are equivalently useful to private or proprietary data. It is also important to note that semantic technologies do not imply some grand, shared schema for organizing all information. Semantic technologies are not “one ring to rule them all,” but rather a way to capture the world views of particular domains and groups of stakeholders. Lastly, semantic technologies done properly are not a replacement for existing information technologies, but rather an added layer that can leverage those assets for interoperability and to overcome the semantic barriers between existing information silos.

Nature of the World

The world is a messy place. Not only is it complicated and richly diverse, but our ways of describing and understanding it are made more complex by differences in language and culture.

We also know the world to be interconnected and interdependent. Effects of one change can propagate into subtle and unforeseen effects. And, not only is the world constantly changing, but so is our understanding of what exists in the world and how it affects and is affected by everything else.

This means we are always uncertain to a degree about how the world works and the dynamics of its working. Through education and research we continually strive to learn more about the world, but often in that process find what we thought was true is no longer so and even our own human existence is modifying our world in manifest ways.

Knowledge is very similar to this nature of the world. We find that knowledge is never complete and it can be found anywhere and everywhere. We capture and codify knowledge in structured, semi-structured and unstructured forms, ranging from “soft” to “hard” information. We find that the structure of knowledge evolves with the incorporation of more information.

We often see that knowledge is not absolute, but contextual. That does not mean that there is no such thing as truth, but that knowledge should be coherent, to reflect a logical consistency and structure that comports with our observations about the physical world. Knowledge, like the world, is constantly changing; we thus must constantly adapt to what we observe and learn.

Knowledge Representation, Not Transactions

These observations about the world and knowledge are not platitudes but important guideposts for how we should organize and manage information, the field known as “information technology.” For IT to truly serve the knowledge function, its logical bases should be consistent with the inherent nature of the world and knowledge.

By knowledge functions we mean those areas of various computer applications that come under the rubrics of search, business intelligence, competitive intelligence, planning, forecasting, data federation, data warehousing, knowledge management, enterprise information integration, master data management, knowledge representation, and so forth. These applications are distinctly different than the earliest and traditional concerns of IT systems: accounting and transactions.

A transaction system — such as calculating revenue based on seats on a plane, the plane’s occupancy, and various rate classes — is a closed system. We can count the seats, we know the number of customers on board, and we know their rate classes and payments. Much can be done with this information, including yield and profitability analysis and other conventional ways of accounting for costs or revenues or optimizations.

But, as noted, neither the world nor knowledge is a closed system. Trying to apply legacy IT approaches to knowledge problems is fraught with difficulties. That is the reason that for more than four decades enterprises have seen massive cost overruns and failed projects in applying conventional IT approaches to knowledge problems: traditional IT is fundamentally mismatched to the nature of the problems at hand.

What works efficiently for transactions and accounting is a miserable failure applied to knowledge problems. Traditional relational databases work best with structured data; are inflexible and fragile when the nature (schema) of the world changes; and thus require constant (and expensive) re-architecting in the face of new knowledge or new relationships.

Of course, often knowledge problems do consider fixed entities with fixed attributes to describe them. In these cases, relational data systems can continue to act as valuable contributors and data managers of entities and their attributes. But, in the role of organizing across schema or dealing with semantics and differences of definition and scope – that is, the common types of knowledge questions – a much different integration layer with a much different logic basis is demanded.

The New Open World Paradigm

The first change that is demanded is to shift the logic paradigm of how knowledge and the world are modeled. In contrast to the closed-world approach of transaction systems, IT systems based on the logical premise of the open world assumption (OWA) mean:

Lack of a given assertion does not imply whether it is true or false; it simply is not known
A lack of knowledge does not imply falsity
Everything is permitted until it is prohibited
Schema can be incremental without re-architecting prior schema (“extensible”), and
Information at various levels of incompleteness can be combined.

Much more can be said about OWA, including formal definitions of the logics underlying it [1], but even from the statements above, we can see that the right logic for most knowledge representation (KR) problems is the open world approach.

This logic mismatch is perhaps the most fundamental cause of failures, cost overruns, and disappointing deliverables for KM and KR projects over the years. But, like the fingertip between the eyes that cannot be seen because it is too close at hand, the importance of this logic mismatch strangely continues to be overlooked.

Integrating All Forms of Information

Data exists in many forms and of many natures. As one classification scheme, there are:

Structured data — information presented according to a defined data model, often found in relational databases or other forms of tabular data
Semi-structured data — does not conform to the formal structure of data models, but contains tags or other markers to denote fields within the content. Markup languages embedded in text are a common form of such sources
Unstructured data — information content, generally oriented to text, that lacks an explicit data model or schema; structured information can be obtained from it via data mining or information extraction.

Further, these types of data may be “soft”, such as social information or opinion, or “hard”, more akin to measurable facts or quantities.

These various forms may also be serialized in a variety of data formats or data transfer protocols, some using straight text with a myriad of syntax or markup vocabularies, ranging to scripts or forms encoded or binary.

Still further, any of these data forms may be organized according to a separate schema that describes the semantics and relationships within the data.

These variations further complicate the inherently diverse nature of the world and knowledge of it. A suitable data model for knowledge representation must therefore have the power to be able to capture the form, format, serialization or schema of any existing data within the diversity of these options.

The Resource Description Framework (RDF) data model has such capabilities [2]. Any extant data form or schema (from the simple to the complex) can be converted to the RDF data model. This capability enables RDF to act as a “universal solvent” for all information.

Once converted to this “canonical” form, RDF can then act as a single representation around which to design applications and other converters (for “round-tripping” to legacy systems, for example), as illustrated by this diagram:

Generic tools can then be driven by the RDF data model, which leads to fewer applications required and lower overall development costs.

Lastly, RDF can represent simple assertions (“Jane runs fast”) to complex vocabularies and languages. It is in this latter role that RDF can begin to represent the complexity of an entire domain via what is called an “ontology” or “knowledge graph.”

Connections Create Graphs

When representing knowledge, more things and concepts get drawn into consideration. In turn, the relationships of these things lead to connections between them to capture the inherent interdependence and linkages of the world. As still more things get considered, more connections are made and proliferate.

This process naturally leads to a graph structure, with the things in the graphs represented as nodes and the relationships between them represented as connecting edges. More things and more connections lead to more structure. Insofar as this structure and its connections are coherent, the natural structure of the knowledge graph itself can help lead to more knowledge and understanding.

How one such graph may emerge is shown by this portion of the recently announced Google Knowledge Graph [3], showing female Nobel prize winners:

Unlike traditional data tables, graphs have a number of inherent benefits, particularly for knowledge representations. They provide:

A coherent way to navigate the knowledge space
Flexible entry points for each user to access that knowledge (since every node is a potential starting point)
Inferencing and reasoning structures about the space
Connections to related information
Ability to connect to any form of information
Concept mapping, and thus the ability to integrate external content
A framework to disambiguate concepts based on relations and context, and
A common vocabulary to drive content “tagging”.

Graphs are the natural structures for knowledge domains.

Network Analysis is the New Algebra

Once built, graphs offer some analytical capabilities not available through traditional means of information structure. Graph analysis is a rapidly emerging field, but already some unique measures of knowledge domains are now possible to gauge:

Influence
Relatedness
Proximity
Centrality
Inference
Clustering
Shortest paths
Diffusion.

As science is coming to appreciate, graphs can represent any extant structure or schema. This gives graphs a universal character in terms of analytic tools. Further, many structures can only be represented by graphs.

Information and Interaction is Distributed

The nature of knowledge is such that relevant information is everywhere. Further, because of the interconnectedness of things, we can also appreciate that external information needs to be integrated with internal information. Meanwhile, the nature of the world is such that users and stakeholders may be anywhere.

These observations suggest a knowledge representation architecture that needs to be truly distributed. Both sources and users may be found in multiple locations.

In order to preserve existing information assets as much as possible (see further below) and to codify the earlier observation regarding the broad diversity of data formats, the resulting knowledge architecture should also attempt to put in place a thin layer or protocol that provides uniform access to any source or target node on the physical network. A thin, uniform abstraction layer – with appropriate access rights and security considerations – means knowledge networks may grow and expand at will at acceptable costs with minimal central coordination or overhead.

Properly designed, then, such architectures are not only necessary to represent the distributed nature of users and knowledge, but can also facilitate and contribute to knowledge development and exchange.

The Web is the Perfect Medium

The items above suggest the Web as an appropriate protocol for distributed access and information exchange. When combined with the following considerations, it becomes clear that the Web is the perfect medium for knowledge networks:

Potentially, all information may be accessed via the Web
All information may be given unique Web identifiers (URIs)
All Web tools are available for use and integration
All Web information may be integrated
Web-oriented architectures (WOA) have proven:
Scalability
Robustness
Substitutability
Most Web technologies are open source.

It is not surprising that the largest extant knowledge networks on the globe – such as Google, Wikipedia, Amazon and Facebook – are Web-based. These pioneers have demonstrated the wisdom of WOA for cost-effective scalability and universal access.

Also, the combination of RDF with Web identifiers also means that any and all information from a given knowledge repository may be exposed and made available to others as linked data. This approach makes the Web a global, universal database. And it is in keeping with the general benefits of integrating external information sources.

Leveraging – Not Replacing – Existing IT Assets

Existing IT assets represent massive sunk costs, legacy knowledge and expertise, and (often) stakeholder consensus. Yet, these systems are still largely stovepiped.

Strategies that counsel replacement of existing IT systems risk wasting existing assets and are therefore unlikely to be adopted. Ways must be found to leverage the value already embodied in these systems, while promoting interoperability and integration.

The beauty of semantic technologies – properly designed and deployed in a Web-oriented architecture – is that a thin interoperability layer may be placed over existing IT assets to achieve these aims. The knowledge graph structure may be used to provide the semantic mappings between schema, while the Web service framework that is part of the WOA provides the source conversion to the canonical RDF data model.

Via these approaches, prior investments in knowledge, information and IT assets may be preserved while enabling interoperability. The existing systems can continue to provide the functionality for which they were originally designed and deployed. Meanwhile, the KR-related aspects may be exposed and integrated with other knowledge assets on the physical network.

Democratizing the Knowledge Function

These kinds of approaches represent a fundamental shift in power and roles with respect to IT in the enterprise. IT departments and their bottlenecks in writing queries and bespoke application development can now be bypassed; the departments may be relegated to more appropriate support roles. Developers and consultants can now devote more of their time to developing generic applications driven by graph structures [4].

In turn, the consumers of knowledge applications – namely subject matter experts, employees, partners and stakeholders – now become the active contributors to the graphs themselves, focusing on reconciling terminology and ensuring adequate entity and concept coverage. Knowledge graphs are relatively straightforward structures to build and maintain. Those that rely on them can also be those that have the lead role in building and maintaining them.

Thus, graph-driven applications can be made generic by function with broader and more diverse information visualization capabilities. Simple instructions in the graphs can indicate what types of information can be displayed with what kind of widget. Graph-driven applications also mean that those closest to the knowledge problems will also be those directly augmenting the graphs. These changes act to democratize the knowledge function, and lower overall IT costs and risks.

Seven Pillars of the Semantic Enterprise

Seven Pillars of an Open Semantic Enterprise

Elsewhere we have discussed the specific components that go into enabling the development of a semantic enterprise, what we have termed the seven pillars [5]. Most of these points have been covered to one degree or another in the discussion above.

There are off-the-shelf starter kits for enterprises to embrace to begin this process. The major starting requirements are to develop appropriate knowledge graphs (ontologies) for the given domain and to convert existing information assets into appropriate interoperable RDF form.

Beyond that, enterprise staff may be readily trained in the use and growth of the graphs, and in the staging and conversion of data. With an appropriate technology transfer component, these semantic technology systems can be maintained solely by the enterprise itself without further outside assistance.

Summary of Semantic Technology Benefits

Unlike conventional IT systems with their closed-world approach, semantic technologies that adhere to these guidelines can be deployed incrementally at lower cost and with lower risk. Further, we have seen that semantic technologies offer an excellent integration approach, with no need to re-do schema because of changed circumstances. The approach further leverages existing information assets and brings the responsibility for the knowledge function more directly to its users and consumers.

Semantic technologies are thus well-suited for knowledge applications. With their graph structures and the ability to capture semantic differences and meanings, these technologies can also accommodate multiple viewpoints and stakeholders. There are also excellent capabilities to relate all available information – from documents and images and metadata to tables and databases – into a common footing.

These advantages will immediately accrue through better integration and interoperability of diverse information assets. But, for early adopters, perhaps the most immediate benefit will come from visible leadership in embracing these enabling technologies in advance of what will surely become the preferred approach to knowledge problems.

[1] For more on the open world assumption (OWA), see the various entries on this topic on Michael Bergman’s AI3:::Adaptive Information blog. This link is a good search string to discover more.

[2] M.K. Bergman, 2009. Advantages and Myths of RDF, white paper from Structured Dynamics LLC, April 22, 2009, 13 pp. See https://www.mkbergman.com/wp-content/themes/ai3v2/files/2009Posts/Advantages_Myths_RDF_090422.pdf.

[3] Google Knowledge Graph; see http://www.google.com/insidesearch/features/search/knowledge.html.

[4] For the most comprehensive discussion of graph-driven apps, see M. K. Bergman, 2011. ” Ontology-Driven Apps Using Generic Applications,” posted on the AI3:::Adaptive Information blog, March 7, 2011. You may also search on that blog for ‘ODapps‘ to see related content.

[5] M.K. Bergman, 2010. “Seven Pillars of the Open Semantic Enterprise,” in AI3:::Adaptive Information blog, January 12, 2010; see https://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/.

Main Links

Search

Author: Mike Bergman