Posted:February 11, 2013

The Semantic Enterprise Part 5 in the Enterprise-scale Semantic Systems Series

We become such captives to our language and what we think we are saying. Many basic words or concepts, such as “search,” seem to fit into that mould. A weird thing with “search” is that twenty years ago the term and its prominence were quite different. Today, “search” is ubiquitous and for many its embodiment is Google such that “to google” is often the shorthand for “to search”.

When we actually do “search”, we submit a query. The same extension that has inveigled its tendrils into search has also caused the idea of a query to become synonymous with standard text (Google) search.

But, there’s more, much more, both within the meaning and the execution of “to search”.

Enterprises, familiar with structured query language (SQL), have understood for quite some time that queries and search were more than text searches to search engines. Semantic technologies have their own structured query approach, SPARQL. Enterprises know the value of search from discovery to research and navigation. And, they also intuitively know that they waste much time and don’t often get what they want from search. U.S. businesses alone could derive $33 billion in annual benefits from being better able to re-find previously discovered Internet information, an amount equal to $10 million per firm for the 1,000 largest businesses [1]. And those benefits, of course are only for Internet searches. There are much larger costs arising from unnecessary duplicated effort because of weaknesses in internal search [1].

The thing that’s great about semantic search — done right — is it combines conventional text search with structured search, adds more goodies, and basically overcomes most current search limitations.

Many Kinds of Search

The Webster definition of “search” is to, look into or over carefully or thoroughly in an effort to find or discover something.”

There are two telling aspects to this definition. One, search may be either casual or careful, from “looking” into something to being “thorough”. Second, search may have as its purpose finding or discovery. Finding, again, implies purpose or research. Discovery can range from serendipity to broadening one’s understanding or horizons given a starting topic.

Prior to the relational systems, network databases represented the state-of-the-art. One of E.F. Codd‘s stated reasons in developing the relational approach and its accompanying SQL query language was to shift the orientation of databases from links and relationships (the network approach) to query and focused search [2]. By virtue of the technology design put forward, relational databases shifted the premise to structured information and direct search queries. Yet, as noted, this only represents the purposeful end of the search spectrum; navigation and discovery now becomes secondary.

Text search and (text) search engines then came to the fore, offering a still-different model of indexing and search. Each term became a basis for document retrieval, leading to term-based means of scoring (the famous Salton TF/IDF statistical model), but with actually no understanding of the semantic structure or meaning of the document. Other term-based retrieval bases, such as latent semantic indexing, were put forward, but these were based on the statistical relationships between terms in documents, and not the actual meaning of the text or natural language within the documents.

What we see in the early evolution of “search” is kind of a fragmented mess. Structured search swung from navigation to purposeful queries. Text search showed itself to be term-based and reliant on Boolean logic. Each approach and information store thus had its own way to represent or index the data and a different kind of search function to access it. Web search, with its renewal of links and relationships, further shifted the locus back to the network model.

State-of-the-art semantic search , as practiced by Structured Dynamics, has found a way to combine these various underlying retrieval engines with the descriptive power of the graph and semantic technologies to provide a universal search mechanism across all types of information stores. We describe this basis more fully below, but what is important to emphasize at the outset is that this approach fundamentally addresses all aspects of search within the enterprise. As a compelling rationale for trying and then adopting semantic technologies, semantic search is the primary first interest for most enterprises.

Unique Advantages to Semantic Search

The first advantage of semantic search is that all content within the organization can be combined and searched at once. Structured stuff . . . documents . . . image metadata . . . databases . . . can now all be characterized and put on an equivalent search footing. As we just discussed in text as a first class citizen, this power of indexing all content types is the real dynamo underneath semantic search.

The universality of search means that being able to search all available content is awesome. But, being able to add the dimensions of relationships between things means that the semantic graph takes information exploration to a totally new level.

The simplest way to understand semantic search is to de-construct the basic RDF triple down to its fundamentals. This first tells us that the RDF data model is able to represent any thing, that is, an object or idea. And, we can represent that object in virtually any way that any viewer would care to describe it, in any language. Do we want it to be big, small? blue, green? meaningful, silly? smart, stupid? The data model allows this and more. We can capture how diverse users describe the same thing in diverse ways.

But, now that I have my world populated with things and descriptions of them, how do they connect? What are the relationships between these things? It is the linkages — the connections, the relationships — between things that give us context, the basis for classifying, and as a result, the means to ascertain the similarity or adjacency of those things. These sorts of adjacencies then enable us to understand the “pattern” of the thing, which is ultimately the real basis for organizing our world.

The rich brew of things (‘nouns”) and the connections between them (“verbs”) starts to give our computers a basis for describing the world more akin to our real language. It is not perfect, and even if it were, it would still suffer from the communication challenges that occur between all of us as humans. Language itself is another codified way of transmitting messages, which will always suffer some degree of loss [3]. But in this comment we can also glean a truth: humans interacting with their computer servants will be more effective the more “human” their interfaces are. And this truth can also give us some insight into what search must do.

First, we are interested in classifying and organizing things. The idea of “facets”, the arrangement of search results into categories based on indexed terms, is not a new one in search. In conventional approaches, “facets” are approached as a kind of dimension, one that is purposefully organized, sometimes hierarchically. In Web interfaces, facets most often appear as a listing in a left-hand column from which one or more of these dimensions might be selected, sometimes with a count number of potential results after the facet or sometimes with a checkbox or such by which multiple of these facets might be combined. In essence, these facets act as structural or classificatory “filters” for the content at hand. This is made potentially more powerful when also combined with basic keyword search.

In semantic search, facets may be derived from not only what types of things exist in the search space, but also what kinds of attributes (or properties) connect them. And, this all comes for free. Unlike conventional faceting, no one needs to decide what are the important “dimensions” or any such. With semantic search, the very basis of describing the domain at hand creates an organization of all things in the space.  As a result of semantic search, this combination of entities and properties leads to what could be called “global faceting”. The structure of how the domain is described is the sole basis required to gain — and make universal to the information space — these facets.

Whoa! How did that happen? All we did is describe our information space, but now we have all of this rich structure. This is when the first important enterprise realization sets in:  how we describe the information in our domain is the driving, critical factor. Semantic search is but easy pickings from this baseline. What is totally cool about the nature of semantic search is that slicing-and-dicing would put a sushi restaurant to shame. Every property represents a different pathway; and every entry (node) is an entry point.

Second, because we have based all of this on an underlying logic model in descriptive logics, we gain a huge Archimedes’ lever about our information space. We do not need to state all of the relationships and organizations in our information space. We can infer them from the assertions already made. Two parents have a child? That child has a sibling? Then, we can infer the second child also has the same parents. The “facts” that one might assume about a given domain can grow by 10x or more when inference is included.

Now we can begin to see where the benefits and return from semantic search becomes evident. Semantic search also enables a qualitatively different content enrichment: we can use these broad understandings of our content to do better targeting, tagging, highlighting or relating concepts to one another. The fact that semantic search is simply a foundation to semantic publishing is noteworthy. We will discuss this topic in a later part to this series.

SD’s Approach: RDF Triple Store + Solr + OWLAPI

In recognition of the primacy of search, we at Structured Dynamics were one of the first in the semantic Web community to add Solr (based on Lucene) full-text indexing to the structured search of an RDF triple store [4]. We later added the OWL API to gain even more power in our structured queries [5]. These three components give us the best of unstructured and structured search, and enable us to handle all kinds of search with additional flexibility at scale. Since we historically combined RDF and Solr first, let’s discuss it first.

We first adopted Solr because traditional text search of RDF triple stores is not sufficiently performant and makes it difficult to retrieve logical (user) labels in place of the URIs used in semantic technologies. While RDF and its graph model provide manifest benefits (see below), text search is a relatively mature technology and Solr provided commercial-grade features and performance in an open source option.

In our design, the triple store is the data orchestrator. The RDF data model and its triple store are used to populate the Solr schema index. The structural specifications (schema) in the triple store guide the development of facets and dynamic fields within Solr. These fields and facets in Solr give us the ability to gain Solr advantages such as aggregates, autocompletion, spell checkers and the like. We also are able to capture the full text if the item is a document, enabling standard text search to be combined with the structural aspects orchestrated from the RDF. On the RDF side, we can leverage the schema of the underlying ontologies to also do inferencing (via forward chaining). This combination gives us an optimal search platform to do full-text search, aggregates and filtering.

Since our initial adoption of Solr, and Solr’s own continued growth, we have been able to (more-or-less) seamlessly embrace geo-locational based search, time-based search, the use of multiple search profiles and ranking and scoring approaches (using Solr’s powerful extended disMax edismax parser) and other advantages. We now have nearly five years of experience of the RDF + Solr combination. We continue to discover new functionality and power in this combination. We are extremely pleased with this choice.

On the structured data side, RDF and its graph model have many inherent advantages, as earlier described. One of those advantages is the graph structure itself:

Example Taxonomy Structure Example Ontology Structure
A distinguishing characteristic of ontologies compared to conventional hierarchical structures is their degree
of connectedness, their ability to model coherent, linked relationships

Another advantage over conventional structured search (SQL) with relational databases is performance. For example, as Rik Van Bruggen recently explained [6], RDBMs searches that need to obtain information from more than one table require a “join.” The indexes in all applicable tables need to be scanned recursively to find all the data elements fitting the query criteria. Conversely, in a graph database, the index needs only be accessed once to find the starting point in the graph, after which the relationships in the graph are “walked” to traverse the graph to find the next applicable data elements. The need for complete scans is what makes “joins” expensive computationally. Graph queries are incredibly fast because index lookups are hugely reduced.

Queries that experienced DBAs with relational databases would never attempt because of the excessive need for joins are trivial in a graph search.

Various graph databases provide canned means for traversing or doing graph-based operations. And that brings us to the second addition we added to the RDF triple store: inclusion of the OWL API. While it is true that our standard triple store, Virtuoso, has support for simple inferencing and forward chaining, the fact that our semantic technologies are based on OWL 2 means that we can bring more power to bear with an ontology-specific API, including reasoners.The OWL API allows all or portions of the ontology specification to be manipulated separately, with a variety of serializations. Changes made to the ontology can also be tested for validity. Most leading reasoners can interact directly with the API. Protégé 4 also interacts directly with the API, as can various rules engines. Additionally, other existing APIs, notably the Alignment API with its own mapping tools and links to other tools such as S-Match can interact with the OWL API.

Thus, besides the advantages of RDF and graph-based search, we can now reason over and manipulate the ontologies themselves to bring even more search power to the system. Because of the existing integrations between the triple store and Solr, these same retrieval options can also be used to inform Solr query retrievals.

Shaking Hands with the Enterprise

On the face of it, a search infrastructure based on three components — triple store + Solr + OWL API — appears more complex than a single solution. But, enterprises already have search provided in many different guises involving text or SQL-based queries. Structured Dynamics now has nearly five years experience with this combined search configuration. Each deployment results in better installation and deployment procedures, including scripting and testing automation. The fact there are three components to the search stack is not really the challenge for enterprise adoption.

This combined approach to search really poses two classes of challenges to the enterprise. The first, and more fundamental one, is the new mindset that semantic search requires. Facets need to be understood and widely embraced; graphs and graph traversals are quite new concepts; full incorporation of tagging to make text a first-class citizen with structured search needs to be embraced; and, the pivotal role of ontologies in driving the whole structural understanding of the domain and all the various ways to describe it means a shift in thinking from dedicated applications for specific purposes to generic ontology-driven applications. These new mindsets require concerted knowledge transfer and training. Many of the new implementers are now the subject matter experts and content editors within the enterprise, rather than developers. Dedicated effort is also necessary — and needs to be continually applied — to enable ontologies to properly and adaptively capture the enterprise’s understanding of its applicable domain.

These are people-oriented aspects that require documentation, training materials, tools and work processes. These topics, actually some of the most critical to our own services, are discussed in later parts to this ESSS series.

The second challenge is in the greater variability and diversity of the “dials and knobs” now available to the enterprise to govern how these search capabilities actually work. The ranking of search results can now embrace many fields and attributes; many different types of content; and potentially different contexts. Weights (or “boosts” in Solr terms) can be applied to every single field involved in a search. Fields may be included or excluded in searches, thereby acting as filters. Different processors or parsers may be applied to handle such things as text case (upper or lower), stemming for dealing with plurals and variants, spelling variants such as between British and American English, invoking or not synonyms, handling multiple languages, and the like.

This level of control means that purposeful means and frameworks must be put in place that enable responsible managers in the enterprise to decide such settings. Understanding of these “dials and knobs” must therefore also be transferred to the enterprise. Then, easily used interfaces for changing and selecting options and then comparing the results of those changes must be embedded in tools and transferred. (This latter area is quite exciting and one area of innovation SD will be reporting on in the near future.)

The Productivity Benefits

There are actually many public Web sites that are doing fantastic and admirable jobs of bringing broad, complicated, structured search to users, all without much if any semantic technologies in the back end. Some quick examples that come to mind are Trulia in real estate; Fidelity in financial products; Amazon in general retail, etc. One difficulty that semantic search has in comparison to the alternatives is that first-blush inspection of Web sites may not show many large differences.

The real advantages from semantic search comes in its productivity and flexibility. Semantic search frameworks are easier to construct, easier to extend, easier to modify and cheaper to build. Semantic search frameworks are inherently robust. Adding entirely new domains of scope — say from moving from a department level to the entire enterprise or accommodating a new acquisition — can be implemented in a fraction of the time without the need for rework.

It will be necessary to document the use case experience of early adopting enterprises to quantify these productivity and flexibility benefits. From Structured Dynamics’ experience, however, these advantages are in the range of one to two orders of magnitude in reduced deployment and maintenance costs compared to RDBMs-based approaches.

The Tie-in with Semantic Publication

Another hot topic of late has been “semantic publishing” that is of keen interest to media and content-intensive sites on the Web. What is interesting about semantic pubishing, however, is that it is completely founded on semantic search. All of the presentation or publishing of content in the interface (or in an exported form) is the result of search. Remember, due to Structured Dynamics’ semantic technology design with its structWSF interfaces, all interaction with the underlying engines and system occur via queries.

We will be talking much about semantic publishing toward the conclusion of this series. We will cover content enrichment, new kinds of products such as topic pages and semantic forms and widgets, and the fact that semantic publishing is available almost for “free” when your stack is based on semantic technologies with semantic search, SD-style.

NOTE: This is part of an ongoing series on enterprise-scale semantic systems (ESSS), which has its own category on this blog. Simply click on that category link to see other articles in this series.

[1] M.K. Bergman, 2004. “Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,” BrightPlanet Corporation White Paper, December 2004, 41 pp. Published on this blog at
[2] See, for instance, the Wikipedia entry on the historical development of databases.
[3] M.K. Bergman, 2012. “What is Structure?,” AI3:::Adaptive Information blog, May 28, 2012.
[4] F. Giasson, 2009. “RDF Aggregates and Full Text Search on Steroids with Solr,” Fred Giasson’s blog, April 9, 2009.
[5] M.K. Bergman, 2010. “A New Landscape in Ontology Development Tools,”, AI3:::Adaptive Information blog, September 7, 2010.
[6] See, for example, Rik Van Bruggen, 2013. “Demining the ‘Join Bomb’ with Graph Queries,” Neo4J blog, January 28, 2013.
Posted:January 28, 2013

The Semantic Enterprise Part 4 in the Enterprise-scale Semantic Systems Series

Text, text everywhere, but no information to link!

For at least a quarter of a century the amount of information within an enterprise embedded in text documents has been understood to be on the order of 80%; more recent estimates put that contribution at 90%. But, whatever the number, or no matter how you slice it, the percentage of information in documents has been overwhelming for enterprises.

The first documentation systems, Documentum being a notable pioneer, helped keep track of versions and characterized its document stores with some rather crude metadata. As document management systems evolved — and enterprise search became a go-to application in its own right — full-text indexing and search was added to characterize the document store. Search allowed better access and retrieval of those documents, but still kept documents as a separate information store from the true first citizens of information in enterprises — structured databases.

That is now changing — and fast. Particularly with semantic technologies, it is now possible to “tag” or characterize documents not only in terms of administrative and manually assigned tags, but with concepts and terminology appropriate to the enterprise domain.

Early systems tagged with taxonomies or thesauri of controlled vocabulary specific to the domain. Larger enterprises also often employ MDM (master data management) to help ensure that these vocabularies are germane across the enterprise. Yet, even still, such systems rarely interoperate with the enterprises’ structured data assets.

Semantic technologies offer a huge leverage point to bridge these gaps. Being able to incorporate text as a first-class citizen into the enterprise’s knowledge base is a major rationale for semantic technologies.

Explaining the Basis

Let’s start with a couple of semantic givens. First, as I have explained many times on this blog, ontologies — that is, knowledge graphs — can capture the rich relationships between things for any given domain. Second, this structure can be more fully expressed via expanded synonyms, acronyms, alternative terms, alternative spellings and misspellings, all in multiple languages, to describe the concepts and things represented in this graph (a construct we have called “semsets“.) That means that different people talking about the same thing with different terminology can communicate. This capability is an outcome from following SKOS-based best practices in ontology construction.

Then, we take these two semantic givens and stir in two further ingredients from NLP. We first prepare the unstructured document text with parsing and other standard text processing. These steps are also a precursor to search; they provide the means for natural language processing to obtain the “chunks” of information in documents as structured data. Then, using the ontologies with their expanded SKOS labels, we add the next ingredient of OBIE (ontology-based information extraction) to automatically “tag” candidate items in the source text.

Editors are presented these candidates to accept or not, plus to add others, in review interfaces as part of the workflow. The result is the final subject “tags” assignment. Because it is important to tag both subject concepts or named entities in the candidate text, Structured Dynamics calls this approach “scones“. We have reusable structures and common terminology and syntax (irON) as canonical representations of these objects.

Add Conventional Metadata

Of course, not all descriptive information you would want to assign to a document is only what it is about. Much other structural information describing the document goes beyond what it is about.

Some of this information relates to what the document is: its size, its format, its encoding. Some of this information relates to provenance: who wrote it? who published it? when? when was it revised? And, some of this information relates to other descriptive relationships: where to download it? a picture of it; other formats of it. Of course, any additional information useful to describe the document can be also tagged on at this point.

This latter category is quite familiar to enterprise information architects. These metadata characterizations have been what is common for standard document management systems reaching back for three decades or more now.

So, naturally, this information has proven the test of time and also must have a pathway for getting assigned to documents. What is different is that all of this information can now be linked into a coherent knowledge graph of the domain.

Some Interface and Workflow Considerations

What we are seeking is a framework and workflow that naturally allows all exisitng and new documents to be presented through a pipeline that extends from authoring and review to metadata assignments. This workflow and the user interface screens associated with it are the more difficult aspects of the challenge. It is relatively straightforward to configure and set up a tagger (though, of course, better accuracy and suitability of the candidate tags can speed overall processing time). Making final assignments for subject tags from the candidates and then ensuring all other metadata are properly assigned can be either eased or impeded by the actual workflows and interfaces.

The trick to such semi-automatic processes is to get these steps right. There are the needs for manual overrides when the suggested, candidate tags are not right. Sometimes new terms and semset entries are found when reviewing the processed documents; these need to be entered and then placed into the overall domain graph structure as discovered. The process of working through steps on the tag processing screens should be natural and logical. Some activities benefit from very focused, bespoke functionality, rather than calling up a complicated or comprehensive app.

In enterprise settings these steps need to be recorded, subject to reviews and approvals, and with auditing capabilities should anything go awry. This means there needs to be a workflow engine underneath the entire system, recording steps and approvals and enabling things to be picked up at any intermediate, suspended point. These support requirements tend to be unique to each enterprise; thus, an underlying workflow system that can be readily modified and tailored — perhaps through scripting or configuration interfaces — is favored. Since Drupal is our standard content and user interface framework, we tend to favor workflow engines like State Machine over more narrow, out-of-the-box setups such as the Workflow module.

These screens and workflows are not integral to the actual semantic framework that governs tagging, but are essential complements to it. It is but another example of how the semantic technologies in an enterprise need to be embedded and integrated into a non-semantic environment (see the prior architecture piece in this series).

But, Also Some Caveats

Yet, what we have described above is the technology and process of assigning structured information to documents so that they can interoperate with other data in the enterprise. Once linked into the domain’s knowledge graph and once characterized by the standard descriptive metadata, there is now the ability to search, slice, filter, navigate or discover text content just as if it were structured data. The semantic graph is the enabler of this integration.

Thus, the entire ability of this system to work derives from the graph structure itself. Creating, populating and maintaining these graph structures can be accomplished by users and subject matter experts from within the enterprise, but that requires new training and new skills. It is impossible to realize the benefits of semantic technologies without knowledgeable editors to maintain these structures. Because of its importance, a later part in this series deals directly with ontology management.

While ontology development and management are activities that do not require programming skills or any particular degrees, they do not happen by magic. Concepts need to be taught; tools need to be mastered; and responsibiilties need to be assigned and overseen to ensure the enterprise’s needs are being met. It is exciting to see text become a first-class information citizen in the enterprise, but like any purposeful human activity, success ultimately depends on the people involved.

NOTE: This is part of an ongoing series on enterprise-scale semantic systems (ESSS), which has its own category on this blog. Simply click on that category link to see other articles in this series.
Posted:January 14, 2013

The Semantic Enterprise Part 2 in the Enterprise-scale Semantic Systems Series

Those involved with the semantic Web are passionate as to why they are involved. This passion and the articulateness behind it are notable factors in why there is indeed a ‘semantic Web community.’ Like few other fields — perhaps including genomics or 3D manufacturing — semantic technologies tend to attract exceptionally smart, committed and passionate people.

Across this spectrum of advocates there are thousands of pages of PDFs and academic treatises as to semantic this or semantic that. There is gold in these hills, and much to mine. But, both in grants and in approaching customers, it always comes down to the questions of: What is the argument for semantic technologies? What are the advantages of a semantic approach? What is the compelling reason for spending time and money on semantics as opposed to alternatives?

Fred Giasson and I at Structured Dynamics feel we have done a pretty fair job of answering these questions. Of course, it is always hard to prove a negative — how do the arguments we make stack up against those we have not? We will never know.

Yet, on the other hand, we have found dedicated customers and steady and growing support from the arguments we do make. At least we know we are not scaring potential customers away. Frankly, we suspect our market arguments are pretty compelling. While we discuss many aspects of semantic technologies in our various writings and communications, we have also tended to continually hone and polish our messages. We keep trying to focus. Fewer points are better than more and points that resonate with the market — that address the “pain points” in common parlance — have the greatest impact.

It is also obvious that the arguments an academic needs to make to a funding agency or commission are much different than what is desired by commercial customers. (Not to mention the US intelligence community, which is the largest — yet silent — funder of semantic technologies.) Much of what one can gain from the literature is more of this academic nature, as are most discussions on mailing lists and community fora. We distinctly do not have the academic perspective. Our viewpoint is that of the enterprise, profit-making or non-profit. Theory takes a back seat to pragmatics when there are real problems to solve.

Our three main selling points to this enterprise market relate to data integration and interoperability; search and discovery; and leveraging existing information assets with low risk. How we paint a compelling picture around these topics is discussed for each point below. We conclude with some thoughts about how and the manner we communicate these arguments, perhaps representing some background that others might find useful in how they may make such arguments themselves.

“Semantic Technologies Enable Data Integration and Interoperability”

As I have experienced first hand and have argued many times [1], the Holy Grail of enterprise information technology over the past thirty years has been achieving true data integration and interoperability. It is, I believe, the primary motivating interest for most all IT efforts not directly related to conventional transaction systems. Yet, because of this longstanding and abiding interest, enterprise IT managers react with justifiable skepticism every time new advances in interoperability are claimed.

The claims for semantic technologies are not an exception. But, even in its positioning, there is something in the descriptive phrasing of “semantic technologies” that resonates with the market. Moreover, to overcome the initial skepticism, we also tend to emphasize two bolstering arguments promoting interoperability:

  1. Semantic technologies matched with natural language (NLP) techniques work to integrate unstructured data, finally incorporating the 80% of enterprise information locked up in documents and overcoming the limitations of manually assigned tags, and
  2. The RDF data model is capable of capturing any existing data relationship, and ontologies are capable of capturing any existing information schema.

Since these are two of the core aspects to data integration and have heretofore been limited with conventional approaches, and since they can be demonstrated rather quickly, trust can be placed into the ultimate interoperability argument.

In the end, the ability of semantic technologies to promote rather complete data integration and interoperability will prove to be its most compelling rationale. Yet, achieving this with semantic technologies will require more time and broader scope than what has been instituted to date. By starting smaller and simpler, a more credible entry argument can be made that also is on the direct pathway to interoperability benefits.

“Semantic Technologies Improve Search and Discovery”

On the face of it, search engines and the search function are nearly ubiquitous. Further, search is generally effective in eventually finding information of interest, though sometimes the process of getting there is lengthy and painful.

This inefficiency results because search has three abiding problems. One, there is too much ambiguity in what kind of thing is being requested; disambiguation to the context at hand is lacking. Second, there is a relative lack of richness in the kinds of relationships between things that are presented. We are learning through Web innovations like Wikipedia or the Google Knowledge Graph that there are many attributes that can be related to the things we search. The natural desire is to now see such relationships in enterprise search as well, including some of this public, external content. And, third, because of these two factors, search is not yet an adequate means for discovering new insights and knowledge. We see the benefits of serendipitous discovery, but we have not yet learned how to do this with purpose or in a repeatable way.

More often than not customers see search, with better display of results, at the heart of the budget rationale for semantic projects. The graph structures of semantic schema means that any node can become an entry point to the knowledge space for discovery. The traversal of information relationships occurs from the selection of predicates or properties that create this graph structure in the first place. This richness of characterization of objects also means we can query or traverse this space in multiple languages or via the full spectrum by which we describe or characterize things. Semantic-based knowledge graphs are potentially an explosion of richness in characterization and how those characterizations get made and referred to by any stakeholder. Search structure need not be preordained by some group of designers or information architects, but can actually be a reflection of its user community. It should not be surprising that search offers the quickest and most visible path to conveying the benefits of semantic technologies.

These arguments, too, are a relatively quick win. We can rapidly put in place these semantic structures that make improved search benefits evident. There are two nice things about this argument. First, it is not necessary to comprehensively capture the full knowledge domain of the customer’s interests to show these benefits. Relatively bounded projects or subsets of the domain are sufficient to show the compelling advantages. And, second, as this initial stakehold gets expanded, the basis for the next argument also becomes evident.

“Semantic Technologies Leverage Existing Assets with Low Risk”

I have often spoken about the incremental nature of how semantic technologies might be adopted and the inherent benefits of the open world mindset. This argument is less straightforward to make since it requires the market to contemplate assumptions they did not even know they had.

But, one thing the market does know is the brittleness and (often) high failure rates of knowledge-based internal IT projects. An explication of these causes of failure can help, via the inverse, to make the case for semantic technologies.

We know (or strongly suspect), for example, that these are typically the causes of knowledge-based IT failures:

  • Too broad a scope or the need to embrace too much of the information basis of the domain
  • Changing knowledge and circumstances that causes initial design imperatives to change over the course of a project
  • High visibility for multiple audiences and stakeholders, and no workable means for finding a common view or consensus as to objectives (let alone terminology) for the project amongst these stakeholders.

Getting recognition for these types of failures or challenges creates the opening for discussing the logic underpinnings of conventional IT approaches. The conventional closed-world approach, which is an artifact of using information systems developed for transaction and accounting purposes, is unsuited to open-ended knowledge purposes. The argument and justification for semantic technologies for knowledge systems is that simple.

The attentive reader will have seen that the first two arguments presented above already reify this open world imperative. The integration argument shows the incorporation of non-structured content as a first-class citizen into the information space. The search argument shows increased scale and richness of relationships as new topics and entities get added to the search function, all without adversely impacting any of the prior work or schema. For both arguments, we have expanded our scope and schema alike without needing to re-architect any of the semantic work that preceded it. This is tangible evidence for the open world argument in the context of semantic technologies applied to knowledge problems.

These evidences, plus the fact we have been increasingly incorporating more sources of information with varied structure, most of which already exists within the enterprise’s information assets, shows that semantic technologies can leverage benefits from existing assets at low risk. At this point, if we have told our story well, it should be evident that the semantic approach can be expanded at whatever pace and scope the enterprise finds beneficial, all without impacting what has been previously implemented.

Actually, the argument that semantic technologies leverage existing assets with low risk is perhaps the most revolutionary of the three. Most prior initiatives in the enterprise knowledge space have required wholesale changes or swapping out of existing systems. The unique contribution of semantic technologies is that they can achieve their benefits as a capability layered over existing assets, all without disruption to their existing systems and infrastructure. The degree to which this layering takes place can be driven solely by available budgets with minimal risk to the enterprise.

Ambassadors and Archivists, as well as Entrepreneurs

There are, of course, other messages than can be made, and we ourselves have made them in other circumstances and articles. The three main arguments listed herein, however, are the ones we feel are most useful at time of early engagement with the customer.

Our messages and arguments gain credibility because we are not just trying to “sell” something. We understand that semantic technologies and the mindsets behind them are not yet commonplace. We need to be ambassadors for our passion and work to explain these salient differences to our potential markets. As later parts in this series will discuss, with semantic technologies, one needs to constantly make the sale.

The best semantic technology vendors understand that market education is a core component to commercial success. Once one gets beyond the initial sale, it is a constant requirement to educate the customer with the next set of nuances, opportunities and technologies.

We acknowledge that vendors have other ways to generate “buzz” and “hotness.” We certainly see the consumer space filled with all sorts of silliness and bad business models, But our pragmatic approach is to back up our messaging with full documentation and market outreach. We write much and contribute much, all of which we document on vehicles such as our blogs, commercial Web site, or TechWiki knowledge base. New market participants need to learn and need to be armed with material and arguments for their own internal constituencies. Insofar as we are the agents making these arguments, we also get perceived as knowledgeable subject matter experts in the semantic technology space.

I have talked in my Of Flagpoles and Fishes article of the challenges of marketing to a nascent market where most early sales prospects remain hidden. At this stage in the market, our best approach is to share and communicate with new market prospects in a credible and helpful way. Then, we hope that some of those seeking more information are also in a position to commission real work. If we are at all instrumental in those early investigations, we are likely to be considered as a potential vendor to fulfill the commercial need.

Of course, each new engagement in the marketplace means new lessons and new applications. Thus, too, it is important that we become archivists as well. We need to capture those lessons and feed them back to the marketplace in a virtuous circle of learning, sharing, and further market expansion. Targeted messages delivered by credible messengers are the keys to unlocking the semantic technologies market.

NOTE: This is part of an ongoing series on enterprise-scale semantic systems (ESSS), which has its own category on this blog. Simply click on that category link to see other articles in this series.

[1] Simply conduct a search on to see how frequently this topic is a focus of my articles.
Posted:January 1, 2010

(And Wishing All a Healthy and Prosperous 2010!)

According to iProspect, about 56 percent of users use search engines every day, based on a population of which more than 70 percent use the Internet more than 10 hours per week.[1] The average knowledge worker spends 2.3 hrs per day — or about 25% of work time — searching for critical job information.[2] IDC estimates that enterprises employing 1,000 knowledge workers may waste well over $6 million per year each in searching for information that does not exist, failing to find information that does, or recreating information that could have been found but was not.[3]

Vendors and customers often use time savings by knowledge workers as a key rationale for justifying a document or content initiative. This comes about because many studies over the years have noted that white collar employees spend a consistent 20% to 25% of their time seeking information. The premise is that more effective search will save time and drop these percentages. For example, EDS has suggested that improvements of 50 percent in the time spent searching for data can be achieved through improved consolidation and access to data.[4]

Using these premises, consultants often calculate that every 1% reduction in the total work time devoted to search works out illustratively on a fully burdened basis as a big cost savings benefit:

$50,000 (base salary) * 1.8 (burden rate) * 1.0% = $900/ employee

Beware such facile analysis!

The fact that many studies over the years have noted white collar employees spend a consistent 20% to 25% of their time devoted to search suggests it is the “satisficing” allocation of time to information search. (In other words, knowledge workers are willing to devote a quarter of their time to finding relevant information; the remainder for analysis and documentation.)

Thus, while better tools to aid better discovery may lead to finding better information and making better decisions more productively — an important justification in itself — there may not result a strict time or labor savings from more efficient search.[5] Be careful of justifying project expenditures based on “time savings” related to search. Search is likely to remain the “25% solution.” The more relevant question is whether the time that is spent on search produces better information or not.

Friday Brown Bag Lunch This Friday brown bag leftover was first placed into the AI3 refrigerator on September 14, 2005. No changes have been made to the original posting.

[1] iProspect Corporation, iProspect Search Engine User Attitudes, April/May 2004, 28 pp. See
[2] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See
[3] C. Sherman and S. Feldman, “The High Cost of Not Finding Information,” International Data Corporation Report #29127, 11 pp., April 2003.
[4] M. Doyle, S. Garmon, and T. Hoglund, “Make Your Portal Deliver: Building the Business Case and Maximizing Returns,” EDS White Paper, 10 pp., 2003.
[5] M.E.D. Koenig, “Time Saved — a Misleading Justification for KM,” KMWorld Magazine, Vol 11, Issue 5, May 2002. See

Posted by AI3's author, Mike Bergman Posted on January 1, 2010 at 11:47 am in Brown Bag Lunch, Searching | Comments (1)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:April 21, 2009


SearchMonkey’s Recommended Vocabularies a Useful Resource

I am pleased to report that UMBEL is now included as one of the recommended vocabularies for the Yahoo! SearchMonkey service. Using SearchMonkey, developers and site owners can use structured data to enhance the value of standard Yahoo! search results and customize their presentation, including through “infobars“. SearchMonkey is integral to a concerted effort by Yahoo! to embrace structured data, RDF and the semantic Web.

SearchMonkey was first announced in February 2008 with a beta release in April and then public release in May with 28 supported vocabularies. Then, last October, an additional set of common, external vocabularies were recommended for the system including DBpedia, Freebase, GoodRelations and SIOC. At the same time, some further internal Yahoo! vocabularies and standard Web languages (e.g., OWL, XHTML) were also added.

This is the first vocabulary update since then. Besides UMBEL, the AB Meta and Semantic Tags vocabularies have also been added to this latest revision. (There have also been a few deprecations over time.)

A recommended vocabulary means that its namespace prefix is recognized by SearchMonkey. The namespaces for the recommended vocabularies are reserved. Though site owners may customize and add new SearchMonkey structure, they must be explicitly defined in specific DataRSS feeds.

Structured data may be included in Yahoo! search results from these sources:

  • Yahoo! Index — the core Yahoo! search data with limited structure such as the page’s title, summary, file size, MIME type, etc. This structure is only provided by Yahoo!
  • Semantic Web Data — including microformats and RDF data embedded in the host page
  • Data Feed — A feed of Yahoo! native DataRSS provided by a third party site
  • Custom Data Service — Any data extracted from an (X)HTML page or web service and represented within SearchMonkey as DataRSS.

As a recommended vocabulary, UMBEL namespace references can now be embedded and recognized (and then presented) in Yahoo! search results.

The Current Vocabulary Set

Here are the 34 current vocabularies (plus five deprecated) recognized by the system:

abmetaAB Meta
actionSearchMonkey Actions
assertSearchMonkey Assertions (deprecated)
ccCreative Commons
commerceSearchMonkey Commerce
contextSearchMonkey Context (deprecated)
countrySearchMonkey Country Datatypes
currencySearchMonkey Currency Datatypes
dcDublin Core
feedSearchMonkey Feed
financeSearchMonkey Finance
jobSearchMonkey Jobs
mediaSearchMonkey Media
newsSearchMonkey News
owlOWL ontology language
pageSearchMonkey Page (deprecated)
productSearchMonkey Product
rdfsRDF Schema
referenceSearchMonkey Reference
relSearchMonkey Relations (deprecated)
resumeSearchMonkey Resume
socialSearchMonkey Social
stagSemantic Tags
tagspaceSearchMonkey Tagspace (deprecated)
useSearchMonkey Use Datatypes
xsdXML Schema Datatypes

In addition, there are a number of standard datatypes recognized by SearchMonkey, mostly a superset of XSD (XML Schema datatypes).

What is emerging from this Yahoo! initiative is a very useful set of structured data definitions and vocabularies. These same resources can be great starting points for non-SearchMonkey applications as well.

For More Information

There is quite a bit of online material now available for SearchMonkey, with new expansions and revisions also accompanying this most recent release. As some starting points, I recommend: