It actually is a dimmer memory than I would like: the decision to begin a blog eight years ago, nearly to the day (). Since then, for every month and more often many more times per month, I have posted the results of my investigations or ramblings, mostly like clockwork. But, in a creeping realization, I see that I have not posted any new thoughts on this venue for more than two months! Until that hiatus, I had been biologically consistent.
Maybe, as some say, and I don’t disagree, the high-water mark of the blog has passed. Certainly blog activity has dropped dramatically. The emergence of ‘snippet communications’ now appears dominant based on messages and bandwidth. I don’t loathe it, nor fear it, but I find a world dominated by 140 characters and instant babbling mostly a bore.
From a data mining perspective — similar to peristalsis or the wave in a sports stadium — there is worth in the “crowd” coherence/incoherence and spontaneity. We can see the waves, but most are transient. I actually think that broad scrutiny helps promote separating the wheat from chaff. We can expose free radicals to the sanitizing effect of sunlight. Yet these are waves, only very rarely trends, and most generally not truths. That truth stuff is some really slippery stuff.
But, that is really not what is happening for me. (Though I really live to chaw on truth.) Mostly, I just had nothing interesting to say, so there was no reason to blog about it. And, now, as I look at why I broke my disciplined approach to blogging and why it has gone on hiatus, I still am a bit left scratching my head as to why my pontifications stalled.
Two obvious reasons are that our business is doing gangbusters, and it is hard to sneak away from company good-fortune. Another is that my family and children have been joyously demanding.
Yet all of that deflects from the more relevant explanation. The real reason, I think, that I have not been writing more recently actually relates to the circumstance of semantic techologies. Yes, progress is being made, some instances are notable, but the general “semantic web” or “linked data” enterprise is stalled. The narrative for these things — let alone their expression and relevance — needs to change substantially.
I feel we are in the midst of this intense interstice, but the framing perspective for the next discussions have yet to emerge.
The strange thing about that statement is not the basis in semantic technologies, which are now understood and powerful, but the incorporation of these advantages into enterprise practices and environments. In this sense, semantic technologies are now growing up. Their logic and role is clear and explainable, but how they fit into corporate practice with acceptable maintenance costs is still being discovered.
A popular post on this blog has been the Seven Pillars of the Open Semantic Enterprise. That article described the building blocks – or foundations – to a semantic enterprise, at least from my own perspective. But it has always felt that the reason why anyone should even be interested in this semantic enterprise business deserved its own discussion.
This current article riffs off of that earlier blog to provide the seven rationales or arguments for why pursuing a semantic enterprise makes sense, especially in contrast to conventional or traditional approaches. This riff extends to even re-presenting the seven-spoke wheel from that original article:
Each of these bubbles deserves some discussion.
The first advantage with semantic technologies is that all kinds of information are unified. No matter what information you consider, any content type may become a ‘first-class citizen’. For really the first time, all kinds of information ranging from traditional databases and spreadsheets (“structured”), to markup, Web pages, XML and data messages (“semi-structured”), and then on to documents and text (“unstructured”) or multimedia (via metadata) can be put on a level playing field .
These data, now all treated on an equal footing, can be searched and retrieved by a variety of techniques. These range from SQL, standard text search, or SPARQL, depending on content type. This unique combination enables all of the aspects of findability – find, discover, navigate – to be fulfilled. Because of the diversity of search options available, search results can be varied and “dialed in” depending upon circumstance and needs.
Because all content is represented either as a type of thing (“node” or noun) or the relationships between those things (“predicate”, “property”, “attribute” or characteristic or verb), any and all of those classifiers may be used for faceting or grouping. Further, the relationships put all things in context, useful to ensure results are relevant and disambiguated.
In all cases, these ways of describing things and their relationships to one another are based on the “idea of the thing” and are not bound or restricted to keywords. This means that all the various ways that things can be described – alternative terms, synonyms, acronyms or jargon – including in multiple languages, can be used to find or match these ideas or concepts.
When combined with the ability to infer relationships between things – even if not explicitly asserted – search and discovery for semantic technologies literally blows away any and all alternative approaches to search.
The classic information architecture (IA) diagram relates users to content and context. It is at the nexus of these ideas that actions and actionable information occurs:
Semantic technologies are superior in terms of the ability to capture all forms of content (structured, semi-structured and unstructured) as first-class citizens and to represent it through knowledge graphs (ontologies). Further, the ability to describe this content with multiple labels, languages and descriptors means the “idea of things” is much better captured than via keywords alone.
The explicit accounting of relationships between things with semantic technologies means the ability to capture better context, important for navigation and the reduction of ambiguity. The richness of relationships also means that the way things relate to one another can also be used for grouping, classifying, filtering or finding things.
Users can be better characterized and related to this content and its contexts because of this ability to match metadata to things and relationships. This leads to richer user experiences and the separation of content from presentation, giving the content more power.
In all cases, because of these basic information architecture advantages, the actions that can be taken upon content are far superior in comparison to any alternatives. Unique actions brought by semantic technologies include analysis, graph traversals and answering systems.
But, the ability to do more extends beyond content and context.
The ability of semantic technologies – specifically the RDF data model – to represent all content forms and any possible schema means that any existing content can be represented through a single representation. This makes RDF a form of “universal solvent” in which any content form can be distilled. This has huge implications and advantages.
One advantage is in providing data interoperability. The RDF data model enables any content form or schema to be represented, and the fact that the meanings of things can be mapped to agreed concepts and relationships also means that disparate information sources can be adequately related to one another. The fact that all data has a unique URI identifier means that any information accessible via HTTP can be included. This model with its marriage to ontology graphs leads to an excellent framework for interoperating data.
This same robustness for mapping different data leads to a second advantage, namely, semantic annotation. Concepts may be matched through so-called ontology-based information extraction (OBIE) and entities or things may be matched with named entity recognition (NER). The tags that result from these recognitions can then be placed back into documents via what is known as semantic injection. The result of all of these activities is that the very same content can now be equivalently understood by machines or humans.
The graph orientation of the system and its logic next means that the information structure is computable with unique analysis capabilities. The relationships between things can be understood and inferred, and the graph structures themselves may lead to unique traversal mechanisms and network analysis such as influence, clusters, neighborhoods, connectedness and so forth. No other information structure provides these unique advantages.
The graph structure also means that finding and relating stuff only need access a single index, after which relations can be traced and computed. Conventional relational database systems require joins and multiple index lookups to even approximate a portion of this ability, which can quickly run out of steam with complex requests or queries. (Also, recall that RDBMs can not accommodate the content or schema flexibilities that semantic technologies can, either.)
These advantages – combined with some of the other advantages discussed in next sections – also enable semantic publishing using these technologies. Semantic publishing offers new ways to let data and its characteristics drive how information gets presented.
Besides the RDF data model, the other pivotal aspect of semantic technologies is the knowledge graph, the ontology that captures the logical schema of the problem domain at hand. The knowledge graph is based on logic and is built from simple statements, or assertions.
The so-called “triples” that are these basic assertion statements in the semantic model are like a sentence of subject – predicate – object. The object of one statement can be the subject of another. In this manner, these “triple” barbells get connected together, growing in a graph-like structure as more statements are added. These basic building blocks are easy to understand and easy to correct if problems are found. Because each node (a subject or object) is the “idea of a thing” and not limited by individual labels or language, each of these things can be described in multiple ways with multiple terms or synonyms. Different people using different language to describe the same thing can thus communicate. Further, how these things relate to one another can be as diverse as how things relate to one another in the real world. The knowledge graph is phenomenally capable of describing the relevant world at hand.
All of these components themselves are based on basic first-order logic, which means these graph structures can be reasoned over, including being able to infer what is not strictly asserted, and to test that the assertions that are made make logical sense. This logical sense is what we term coherence. Because of this logical structure, and because of its graph nature, semantic technologies offer unique ways to find things and to analyze them. Increasingly over time we will see graph analysis become a more important aspect of how we analyze and solve problems.
How all of this affects the business is fundamental. Because the characterization of data and the structure of how it interacts together – and because the basic nature of these structures is relatively easy to understand – semantic technologies bring a tectonic shift to the enterprise. Control of how it works now shifts to those who need to consume and manage that information; that is, knowledge workers, managers, and subject-matter experts.
This content is separated from the presentation and the applications that use it. Since so much information is contained in the structure and relationships of the content, these patterns can inform how to present and use the information. For example, the fact that some information contains geo-locational attributes means that it can be mapped; or, we can know that cameras are a kind of device or product. This embedded knowledge can be used to inform how generic applications need to respond and display the info.
Thus, we see that the nexus of control around knowledge management can now shift to those who need and consume that information. The role of information technology moves to the background to provide the infrastructures and tools that can be driven from these information structures. We term these types of applications, “ontology-driven applications”, or ODapps.
We are only now at the very beginnings of this transition. ODapp tools are still few and not mature, and few organizations have even made the cultural transition to shift this locus of control. But, embracing semantic technologies and its innate power to bring information management directly into the hands of those who need it will definitely disrupt the enterprise.
Because these structures and the data model behind them are a natural fit with the nature of information, semantic technologies prove to be both adaptive and robust. The data model is easily extended and modified without affecting the schemas already in place, a circumstance of having an open world logic. An abiding constant of relational technology – the basis for enterprise IT systems over the past few decades – has been its rigidity and difficulty in changing its structure or organization. Such a framework is perhaps the best for transactional systems, but is a poor choice for knowledge systems where the amount of content and its relationships are constantly changing.
A huge lever arising from these underlying semantic technologies is the ability to integrate across the different characteristics of information – its syntax, its structure, and its meaning. Units of measure, or different languages, or different ways to describe the same thing can all be boiled down to a common representation.
As the attention shifts to how we describe our domains and its concepts and instances, we can segregate off the questions of how we present and interact with that information. That means our human-computer interfaces can become more effective. It means that HCI itself can focus more on the channel or device. We are seeing it now with widgets and mobile apps, but our information will increasingly be presented through known interactions no matter what device we use. Semantic technologies are a natural and superior means for this adaptivity.
All of these benefits in productivity, responsiveness, and adaptivity translate into much reduced costs. These reductions come both in lower set-up and deployment costs and in lower maintenance and scope costs. Experience is that these functions can be undertaken on average at lower costs one or two orders of magnitude less with semantic technologies than with traditional approaches.
These reductions come about because we can leverage our existing information stores and schema into a single, “canonical” representation against which we can tool and present. The fact that new sources can be integrated into the system without re-architecting what already exists is another huge win, and a matter that almost always overruns budgets with conventional approaches.
An area little documented is the high cost of errors. It is ubiquitous in our current information systems. But, it is a hidden and huge cost. Because semantic technologies can help in putting information into context, can help resolve ambiguities, and can be tested for integrity and coherence, the chance of identifying errors before they are introduced into the system is great. These benefits are in addition to the measurable deployment and maintenance advantages.
Every domain has its own rationale and arguments for why semantic technologies make sense. In this case, we use a biomedical example. It is particularly suitable because health care and biomedical knowledge, as indeed for all of biology, is a rich domain for semantics.
As the very aspects of life get scrutinized and dissected with our modern technologies and approaches, we are seeing a veritable explosion in the both the amount and nature of biomedical information. Many ideas and concepts not known five to ten or twenty years ago now define this dynamic domain. New and technical terminology keeps arising, but also because the relations are to life and health, these need to be expressed in human terms and at varying levels of sophistication.
This is a perfect example of the relevance of semantic technologies. It is no wonder that more than 250 ontologies now characterize this space, with growth in semantic solutions rapidly occurring because leading institutions and funding agencies are aggressively exploiting and promoting semantic technologies.
For a slide presentation of some of these points, you may see:
The fulcrum by which semantic technologies work within the enterprise is the dataset. A dataset refers to a named grouping of records, best designed as similar in record types and intended access rights (though technically a dataset is any named grouping of records).
Datasets play a central role in the organization of information in Structured Dynamics‘ (SD) open semantic framework (OSF). Datasets are one of the three major access dimensions to the OSF (the other two being users/groups and tools/endpoints). In combination, these three dimensions — datasets, users/groups and tools/endpoints — can also result in a powerful set of profiles that govern overall access to content.
Specific security aspects of the semantic enterprise stack are discussed in another part of this Enterprise-scale Semantic Systems (ESSS) series, but the interplay of those aspects with datasets is fundamental. As such, how datasets are bounded and organized (and, then, named) is a critical management consideration for enterprises that adopt a semantic technology stack based on an architecture like OSF. This role of datasets, how to organize them, how to manage them, and also some best practices for how to use them, are the focus of this part in our series.
To briefly recall the architectural discussion in this series, SD’s semantic technology stack involves a Web services layer (structWSF) used to access specific functional endpoints, all via HTTP queries . Some of these endpoints access complete applications in such areas as tagging, imports/exports, search and the like. Other endpoints individually provide (or not) access to CRUD (create – read- update – delete) rights to interact with either individual records, full datasets, or the ontologies that are the “schema” overlying this information. The net result, at present, is more than 20 individual Web service endpoints to query and interact with the system:
|Auth Registrar: Access||X||X|
|Auth Registrar: WS||X||X|
This structWSF Web services layer has a three-dimensional design that is used to govern access:
A “user” may extend from an individual to an entire class or group of users, say, unregistered visitors to a given portal. Tools refer to each of the structWSF endpoints, each with its own URI.
What this means is that a given user may be granted access or not — and various rights or not from reading to the creation or deletion of information — in relation to specific datasets. Stated another way, it is in the nexus of user type and dataset that access control is established for the semantic system.
In an enterprise context, a given individual (“user”) may have different access rights depending on circumstance. A worker in a department may be able to see and do different things for departmental information than for enterprise information. A manager may be able to view budget information that is not readable by support personnel. A visitor to a different Web site or portal may see different information than visitors to other Web sites. Supervisors might be able to see and modify salary data for certain employees that is not viewable by others.
The user role or persona thus becomes the access identifier to the system. What information and what tools they might use in relation to that information is defined in relation to the datasets for which they have access.
So, let’s say, a given enterprise has two major information stores, #1 and #2, and also has some domain (or departmental or other such boundary) information for X, Y and Z, some of which is local (perhaps for the local branch) and the rest global (for that line of business). Further, let’s also suppose that those same departments also have sensitive, internal information related to either internal matters (such as salaries) or support matters (such as qualified vendors). This basic scenario is laid out in A of the diagram below:
Now, depending, different individuals (most often assigned to different access groups, but that is not required) need to have different access to this information. In one case, a general user with access to mostly public stuff exists for domain B; another for domain C. Then still, a supervisor or someone internally may have responsibilities in the Y domain; that could be case D.
Any of the same variations above could result in a different use case; A – D above is merely illustrative.
It is fairly easy to see that the combination of datasets x tools x roles can lead to many access permutations. With, say, the current 20 some-odd tools in the OSF with five different roles and just ten different datasets, we already have about 1,000 permutations. As portals and dataset numbers grow, this combinatorial explosion gets even worse. Of course, not all combinations of datasets, tools and roles make sense. In fact, only a relatively few number of patterns likely covers 95% or more of all likely access options.
Because access rights are highly patterned, these theoretical combinations can in fact be boiled down to a small number of practical templates — called profiles — to which a newly registered dataset can be assigned. (Of course, the enterprise could also tweak any of the standard profiles to meet any of the combinatorial options for a specific, unusual individual, such as for a tax auditor.) Experience, in fact, shows the number of actual profiles to be surprisingly small.
For instance, consider these possible profile patterns:
Profiles may, of course, be applied to any permutation.
This profile concept can now be expanded to incorporate user type. Four categories of users can illustrate this dimension:
Further, of course, with a multitude of groups, there are potentially many more than four categories (“roles”) of users as well.
To illustrate how we can collapse this combinatorial space into something more manageable, let’s look at what one of the profile cases noted above — that is the Public profile — can now be expressed as a pattern or template. In this example, the Public profile means that owners and some groups may curate the data, but everyone can see and access the data. Also note that export is a special case, which could warrant a sub-profile.
We also need to relate this Public profile to a specific dataset. For this dataset, we can characterize our “possible” assignments as described above as to whether a specific user category (O, G, R and P as noted above) has available a given function (open dot), gets permission rights to that function by virtue of the assigned profile (solid dot), or whether that function may also be limited to a specific group or groups (half-filled dot) or not.
Thus, we can now see this example profile matrix for the Public profile for an example dataset with respect to the available structWSF Web services:
Note, of course, that these options and categories and assignments are purely arbitrary for our illustrative discussion. Actual needs and circumstances may vary wildly from this example.
Matrices such as this seem complex, but that is why profiles can collapse and simplify the potential assignments into a manageable number of discrete options. If the pre-packaged profiles need to be tweaked or adjusted for a particular circumstance, provisions through the CMS enables all assignments to be accessed in individual detail. Via this design, knowledge and collaboration networks can be deployed that support an unlimited number of configurations and options, all in a scalable, Web-accessible manner. The data that is accessed is automatically expressed as linked data. This same framework can be layered over in situ existing data assets to provide data federation and interoperable functionality, all responsive to standard enterprise concerns regarding data access, rights and permissions.
Datasets are clearly one of the fundamental dimensions for organizing content within this OSF semantic enterprise design. Some of the best practices in bounding these structures:
Any of these differences may warrant creating a separate dataset. There are no limits to the number of datasets that may be managed by a given OSF instance.
Once such boundaries get set, then thinking about common attributes or metadata should be applied. Still further, datasets and their records (as all decision or information artifacts in an enterprise) go through natural work stages or progressions. Even the lowliest written document needs to be drafted, reviewed, characterized, approved, and then possibly revised. Whatever such workflow steps may be, including versioning, may warrant consideration as belonging to a different dataset.
Lastly, whatever the operational mode devised, finding naming conventions to reflect these variations is essential to manage the dataset files. Which goes to show: datasets are meaningful information artifacts in and of themselves.
We become such captives to our language and what we think we are saying. Many basic words or concepts, such as “search,” seem to fit into that mould. A weird thing with “search” is that twenty years ago the term and its prominence were quite different. Today, “search” is ubiquitous and for many its embodiment is Google such that “to google” is often the shorthand for “to search”.
When we actually do “search”, we submit a query. The same extension that has inveigled its tendrils into search has also caused the idea of a query to become synonymous with standard text (Google) search.
But, there’s more, much more, both within the meaning and the execution of “to search”.
Enterprises, familiar with structured query language (SQL), have understood for quite some time that queries and search were more than text searches to search engines. Semantic technologies have their own structured query approach, SPARQL. Enterprises know the value of search from discovery to research and navigation. And, they also intuitively know that they waste much time and don’t often get what they want from search. U.S. businesses alone could derive $33 billion in annual benefits from being better able to re-find previously discovered Internet information, an amount equal to $10 million per firm for the 1,000 largest businesses . And those benefits, of course are only for Internet searches. There are much larger costs arising from unnecessary duplicated effort because of weaknesses in internal search .
The thing that’s great about semantic search — done right — is it combines conventional text search with structured search, adds more goodies, and basically overcomes most current search limitations.
The Webster definition of “search” is to, “look into or over carefully or thoroughly in an effort to find or discover something.”
There are two telling aspects to this definition. One, search may be either casual or careful, from “looking” into something to being “thorough”. Second, search may have as its purpose finding or discovery. Finding, again, implies purpose or research. Discovery can range from serendipity to broadening one’s understanding or horizons given a starting topic.
Prior to the relational systems, network databases represented the state-of-the-art. One of E.F. Codd‘s stated reasons in developing the relational approach and its accompanying SQL query language was to shift the orientation of databases from links and relationships (the network approach) to query and focused search . By virtue of the technology design put forward, relational databases shifted the premise to structured information and direct search queries. Yet, as noted, this only represents the purposeful end of the search spectrum; navigation and discovery now becomes secondary.
Text search and (text) search engines then came to the fore, offering a still-different model of indexing and search. Each term became a basis for document retrieval, leading to term-based means of scoring (the famous Salton TF/IDF statistical model), but with actually no understanding of the semantic structure or meaning of the document. Other term-based retrieval bases, such as latent semantic indexing, were put forward, but these were based on the statistical relationships between terms in documents, and not the actual meaning of the text or natural language within the documents.
What we see in the early evolution of “search” is kind of a fragmented mess. Structured search swung from navigation to purposeful queries. Text search showed itself to be term-based and reliant on Boolean logic. Each approach and information store thus had its own way to represent or index the data and a different kind of search function to access it. Web search, with its renewal of links and relationships, further shifted the locus back to the network model.
State-of-the-art semantic search , as practiced by Structured Dynamics, has found a way to combine these various underlying retrieval engines with the descriptive power of the graph and semantic technologies to provide a universal search mechanism across all types of information stores. We describe this basis more fully below, but what is important to emphasize at the outset is that this approach fundamentally addresses all aspects of search within the enterprise. As a compelling rationale for trying and then adopting semantic technologies, semantic search is the primary first interest for most enterprises.
The first advantage of semantic search is that all content within the organization can be combined and searched at once. Structured stuff . . . documents . . . image metadata . . . databases . . . can now all be characterized and put on an equivalent search footing. As we just discussed in text as a first class citizen, this power of indexing all content types is the real dynamo underneath semantic search.
The universality of search means that being able to search all available content is awesome. But, being able to add the dimensions of relationships between things means that the semantic graph takes information exploration to a totally new level.
The simplest way to understand semantic search is to de-construct the basic RDF triple down to its fundamentals. This first tells us that the RDF data model is able to represent any thing, that is, an object or idea. And, we can represent that object in virtually any way that any viewer would care to describe it, in any language. Do we want it to be big, small? blue, green? meaningful, silly? smart, stupid? The data model allows this and more. We can capture how diverse users describe the same thing in diverse ways.
But, now that I have my world populated with things and descriptions of them, how do they connect? What are the relationships between these things? It is the linkages — the connections, the relationships — between things that give us context, the basis for classifying, and as a result, the means to ascertain the similarity or adjacency of those things. These sorts of adjacencies then enable us to understand the “pattern” of the thing, which is ultimately the real basis for organizing our world.
The rich brew of things (‘nouns”) and the connections between them (“verbs”) starts to give our computers a basis for describing the world more akin to our real language. It is not perfect, and even if it were, it would still suffer from the communication challenges that occur between all of us as humans. Language itself is another codified way of transmitting messages, which will always suffer some degree of loss . But in this comment we can also glean a truth: humans interacting with their computer servants will be more effective the more “human” their interfaces are. And this truth can also give us some insight into what search must do.
First, we are interested in classifying and organizing things. The idea of “facets”, the arrangement of search results into categories based on indexed terms, is not a new one in search. In conventional approaches, “facets” are approached as a kind of dimension, one that is purposefully organized, sometimes hierarchically. In Web interfaces, facets most often appear as a listing in a left-hand column from which one or more of these dimensions might be selected, sometimes with a count number of potential results after the facet or sometimes with a checkbox or such by which multiple of these facets might be combined. In essence, these facets act as structural or classificatory “filters” for the content at hand. This is made potentially more powerful when also combined with basic keyword search.
In semantic search, facets may be derived from not only what types of things exist in the search space, but also what kinds of attributes (or properties) connect them. And, this all comes for free. Unlike conventional faceting, no one needs to decide what are the important “dimensions” or any such. With semantic search, the very basis of describing the domain at hand creates an organization of all things in the space. As a result of semantic search, this combination of entities and properties leads to what could be called “global faceting”. The structure of how the domain is described is the sole basis required to gain — and make universal to the information space — these facets.
Whoa! How did that happen? All we did is describe our information space, but now we have all of this rich structure. This is when the first important enterprise realization sets in: how we describe the information in our domain is the driving, critical factor. Semantic search is but easy pickings from this baseline. What is totally cool about the nature of semantic search is that slicing-and-dicing would put a sushi restaurant to shame. Every property represents a different pathway; and every entry (node) is an entry point.
Second, because we have based all of this on an underlying logic model in descriptive logics, we gain a huge Archimedes’ lever about our information space. We do not need to state all of the relationships and organizations in our information space. We can infer them from the assertions already made. Two parents have a child? That child has a sibling? Then, we can infer the second child also has the same parents. The “facts” that one might assume about a given domain can grow by 10x or more when inference is included.
Now we can begin to see where the benefits and return from semantic search becomes evident. Semantic search also enables a qualitatively different content enrichment: we can use these broad understandings of our content to do better targeting, tagging, highlighting or relating concepts to one another. The fact that semantic search is simply a foundation to semantic publishing is noteworthy. We will discuss this topic in a later part to this series.
In recognition of the primacy of search, we at Structured Dynamics were one of the first in the semantic Web community to add Solr (based on Lucene) full-text indexing to the structured search of an RDF triple store . We later added the OWL API to gain even more power in our structured queries . These three components give us the best of unstructured and structured search, and enable us to handle all kinds of search with additional flexibility at scale. Since we historically combined RDF and Solr first, let’s discuss it first.
We first adopted Solr because traditional text search of RDF triple stores is not sufficiently performant and makes it difficult to retrieve logical (user) labels in place of the URIs used in semantic technologies. While RDF and its graph model provide manifest benefits (see below), text search is a relatively mature technology and Solr provided commercial-grade features and performance in an open source option.
In our design, the triple store is the data orchestrator. The RDF data model and its triple store are used to populate the Solr schema index. The structural specifications (schema) in the triple store guide the development of facets and dynamic fields within Solr. These fields and facets in Solr give us the ability to gain Solr advantages such as aggregates, autocompletion, spell checkers and the like. We also are able to capture the full text if the item is a document, enabling standard text search to be combined with the structural aspects orchestrated from the RDF. On the RDF side, we can leverage the schema of the underlying ontologies to also do inferencing (via forward chaining). This combination gives us an optimal search platform to do full-text search, aggregates and filtering.
Since our initial adoption of Solr, and Solr’s own continued growth, we have been able to (more-or-less) seamlessly embrace geo-locational based search, time-based search, the use of multiple search profiles and ranking and scoring approaches (using Solr’s powerful extended disMax edismax parser) and other advantages. We now have nearly five years of experience of the RDF + Solr combination. We continue to discover new functionality and power in this combination. We are extremely pleased with this choice.
On the structured data side, RDF and its graph model have many inherent advantages, as earlier described. One of those advantages is the graph structure itself:
|A distinguishing characteristic of ontologies compared to conventional hierarchical structures is their degree
of connectedness, their ability to model coherent, linked relationships
Another advantage over conventional structured search (SQL) with relational databases is performance. For example, as Rik Van Bruggen recently explained , RDBMs searches that need to obtain information from more than one table require a “join.” The indexes in all applicable tables need to be scanned recursively to find all the data elements fitting the query criteria. Conversely, in a graph database, the index needs only be accessed once to find the starting point in the graph, after which the relationships in the graph are “walked” to traverse the graph to find the next applicable data elements. The need for complete scans is what makes “joins” expensive computationally. Graph queries are incredibly fast because index lookups are hugely reduced.
Various graph databases provide canned means for traversing or doing graph-based operations. And that brings us to the second addition we added to the RDF triple store: inclusion of the OWL API. While it is true that our standard triple store, Virtuoso, has support for simple inferencing and forward chaining, the fact that our semantic technologies are based on OWL 2 means that we can bring more power to bear with an ontology-specific API, including reasoners.The OWL API allows all or portions of the ontology specification to be manipulated separately, with a variety of serializations. Changes made to the ontology can also be tested for validity. Most leading reasoners can interact directly with the API. Protégé 4 also interacts directly with the API, as can various rules engines. Additionally, other existing APIs, notably the Alignment API with its own mapping tools and links to other tools such as S-Match can interact with the OWL API.
Thus, besides the advantages of RDF and graph-based search, we can now reason over and manipulate the ontologies themselves to bring even more search power to the system. Because of the existing integrations between the triple store and Solr, these same retrieval options can also be used to inform Solr query retrievals.
On the face of it, a search infrastructure based on three components — triple store + Solr + OWL API — appears more complex than a single solution. But, enterprises already have search provided in many different guises involving text or SQL-based queries. Structured Dynamics now has nearly five years experience with this combined search configuration. Each deployment results in better installation and deployment procedures, including scripting and testing automation. The fact there are three components to the search stack is not really the challenge for enterprise adoption.
This combined approach to search really poses two classes of challenges to the enterprise. The first, and more fundamental one, is the new mindset that semantic search requires. Facets need to be understood and widely embraced; graphs and graph traversals are quite new concepts; full incorporation of tagging to make text a first-class citizen with structured search needs to be embraced; and, the pivotal role of ontologies in driving the whole structural understanding of the domain and all the various ways to describe it means a shift in thinking from dedicated applications for specific purposes to generic ontology-driven applications. These new mindsets require concerted knowledge transfer and training. Many of the new implementers are now the subject matter experts and content editors within the enterprise, rather than developers. Dedicated effort is also necessary — and needs to be continually applied — to enable ontologies to properly and adaptively capture the enterprise’s understanding of its applicable domain.
These are people-oriented aspects that require documentation, training materials, tools and work processes. These topics, actually some of the most critical to our own services, are discussed in later parts to this ESSS series.
The second challenge is in the greater variability and diversity of the “dials and knobs” now available to the enterprise to govern how these search capabilities actually work. The ranking of search results can now embrace many fields and attributes; many different types of content; and potentially different contexts. Weights (or “boosts” in Solr terms) can be applied to every single field involved in a search. Fields may be included or excluded in searches, thereby acting as filters. Different processors or parsers may be applied to handle such things as text case (upper or lower), stemming for dealing with plurals and variants, spelling variants such as between British and American English, invoking or not synonyms, handling multiple languages, and the like.
This level of control means that purposeful means and frameworks must be put in place that enable responsible managers in the enterprise to decide such settings. Understanding of these “dials and knobs” must therefore also be transferred to the enterprise. Then, easily used interfaces for changing and selecting options and then comparing the results of those changes must be embedded in tools and transferred. (This latter area is quite exciting and one area of innovation SD will be reporting on in the near future.)
There are actually many public Web sites that are doing fantastic and admirable jobs of bringing broad, complicated, structured search to users, all without much if any semantic technologies in the back end. Some quick examples that come to mind are Trulia in real estate; Fidelity in financial products; Amazon in general retail, etc. One difficulty that semantic search has in comparison to the alternatives is that first-blush inspection of Web sites may not show many large differences.
The real advantages from semantic search comes in its productivity and flexibility. Semantic search frameworks are easier to construct, easier to extend, easier to modify and cheaper to build. Semantic search frameworks are inherently robust. Adding entirely new domains of scope — say from moving from a department level to the entire enterprise or accommodating a new acquisition — can be implemented in a fraction of the time without the need for rework.
It will be necessary to document the use case experience of early adopting enterprises to quantify these productivity and flexibility benefits. From Structured Dynamics’ experience, however, these advantages are in the range of one to two orders of magnitude in reduced deployment and maintenance costs compared to RDBMs-based approaches.
Another hot topic of late has been “semantic publishing” that is of keen interest to media and content-intensive sites on the Web. What is interesting about semantic pubishing, however, is that it is completely founded on semantic search. All of the presentation or publishing of content in the interface (or in an exported form) is the result of search. Remember, due to Structured Dynamics’ semantic technology design with its structWSF interfaces, all interaction with the underlying engines and system occur via queries.
We will be talking much about semantic publishing toward the conclusion of this series. We will cover content enrichment, new kinds of products such as topic pages and semantic forms and widgets, and the fact that semantic publishing is available almost for “free” when your stack is based on semantic technologies with semantic search, SD-style.
Text, text everywhere, but no information to link!
For at least a quarter of a century the amount of information within an enterprise embedded in text documents has been understood to be on the order of 80%; more recent estimates put that contribution at 90%. But, whatever the number, or no matter how you slice it, the percentage of information in documents has been overwhelming for enterprises.
The first documentation systems, Documentum being a notable pioneer, helped keep track of versions and characterized its document stores with some rather crude metadata. As document management systems evolved — and enterprise search became a go-to application in its own right — full-text indexing and search was added to characterize the document store. Search allowed better access and retrieval of those documents, but still kept documents as a separate information store from the true first citizens of information in enterprises — structured databases.
That is now changing — and fast. Particularly with semantic technologies, it is now possible to “tag” or characterize documents not only in terms of administrative and manually assigned tags, but with concepts and terminology appropriate to the enterprise domain.
Early systems tagged with taxonomies or thesauri of controlled vocabulary specific to the domain. Larger enterprises also often employ MDM (master data management) to help ensure that these vocabularies are germane across the enterprise. Yet, even still, such systems rarely interoperate with the enterprises’ structured data assets.
Semantic technologies offer a huge leverage point to bridge these gaps. Being able to incorporate text as a first-class citizen into the enterprise’s knowledge base is a major rationale for semantic technologies.
Let’s start with a couple of semantic givens. First, as I have explained many times on this blog, ontologies — that is, knowledge graphs — can capture the rich relationships between things for any given domain. Second, this structure can be more fully expressed via expanded synonyms, acronyms, alternative terms, alternative spellings and misspellings, all in multiple languages, to describe the concepts and things represented in this graph (a construct we have called “semsets“.) That means that different people talking about the same thing with different terminology can communicate. This capability is an outcome from following SKOS-based best practices in ontology construction.
Then, we take these two semantic givens and stir in two further ingredients from NLP. We first prepare the unstructured document text with parsing and other standard text processing. These steps are also a precursor to search; they provide the means for natural language processing to obtain the “chunks” of information in documents as structured data. Then, using the ontologies with their expanded SKOS labels, we add the next ingredient of OBIE (ontology-based information extraction) to automatically “tag” candidate items in the source text.
Editors are presented these candidates to accept or not, plus to add others, in review interfaces as part of the workflow. The result is the final subject “tags” assignment. Because it is important to tag both subject concepts or named entities in the candidate text, Structured Dynamics calls this approach “scones“. We have reusable structures and common terminology and syntax (irON) as canonical representations of these objects.
Of course, not all descriptive information you would want to assign to a document is only what it is about. Much other structural information describing the document goes beyond what it is about.
Some of this information relates to what the document is: its size, its format, its encoding. Some of this information relates to provenance: who wrote it? who published it? when? when was it revised? And, some of this information relates to other descriptive relationships: where to download it? a picture of it; other formats of it. Of course, any additional information useful to describe the document can be also tagged on at this point.
This latter category is quite familiar to enterprise information architects. These metadata characterizations have been what is common for standard document management systems reaching back for three decades or more now.
So, naturally, this information has proven the test of time and also must have a pathway for getting assigned to documents. What is different is that all of this information can now be linked into a coherent knowledge graph of the domain.
What we are seeking is a framework and workflow that naturally allows all exisitng and new documents to be presented through a pipeline that extends from authoring and review to metadata assignments. This workflow and the user interface screens associated with it are the more difficult aspects of the challenge. It is relatively straightforward to configure and set up a tagger (though, of course, better accuracy and suitability of the candidate tags can speed overall processing time). Making final assignments for subject tags from the candidates and then ensuring all other metadata are properly assigned can be either eased or impeded by the actual workflows and interfaces.
The trick to such semi-automatic processes is to get these steps right. There are the needs for manual overrides when the suggested, candidate tags are not right. Sometimes new terms and semset entries are found when reviewing the processed documents; these need to be entered and then placed into the overall domain graph structure as discovered. The process of working through steps on the tag processing screens should be natural and logical. Some activities benefit from very focused, bespoke functionality, rather than calling up a complicated or comprehensive app.
In enterprise settings these steps need to be recorded, subject to reviews and approvals, and with auditing capabilities should anything go awry. This means there needs to be a workflow engine underneath the entire system, recording steps and approvals and enabling things to be picked up at any intermediate, suspended point. These support requirements tend to be unique to each enterprise; thus, an underlying workflow system that can be readily modified and tailored — perhaps through scripting or configuration interfaces — is favored. Since Drupal is our standard content and user interface framework, we tend to favor workflow engines like State Machine over more narrow, out-of-the-box setups such as the Workflow module.
These screens and workflows are not integral to the actual semantic framework that governs tagging, but are essential complements to it. It is but another example of how the semantic technologies in an enterprise need to be embedded and integrated into a non-semantic environment (see the prior architecture piece in this series).
Yet, what we have described above is the technology and process of assigning structured information to documents so that they can interoperate with other data in the enterprise. Once linked into the domain’s knowledge graph and once characterized by the standard descriptive metadata, there is now the ability to search, slice, filter, navigate or discover text content just as if it were structured data. The semantic graph is the enabler of this integration.
Thus, the entire ability of this system to work derives from the graph structure itself. Creating, populating and maintaining these graph structures can be accomplished by users and subject matter experts from within the enterprise, but that requires new training and new skills. It is impossible to realize the benefits of semantic technologies without knowledgeable editors to maintain these structures. Because of its importance, a later part in this series deals directly with ontology management.
While ontology development and management are activities that do not require programming skills or any particular degrees, they do not happen by magic. Concepts need to be taught; tools need to be mastered; and responsibiilties need to be assigned and overseen to ensure the enterprise’s needs are being met. It is exciting to see text become a first-class information citizen in the enterprise, but like any purposeful human activity, success ultimately depends on the people involved.