Posted:September 26, 2011

OWL - Web Ontology LanguageDocumenting the Emerging Ecosystem Around OWL 2

We have been touting the importance of OWL 2 as the language of choice for federating and reasoning over RDF and ontologies. An absolutely essential enabler of the OWL 2 language is version 3 of the OWL API (actually, version 3.2.4 at the time of this writing), a Java-based framework for accessing and managing the language. Protégé 4, the most popular open source ontology editor and integrated development environment (IDE), for example, is built around the OWL API.

As we laid out a bit more than a year ago, now codified on our TechWiki as the Normative Landscape of Ontology Tools (especially the second figure), we see the OWL API as the essential pivot point for all forms of ontology tools moving forward.

We have attempted to assemble a definitive and comprehensive list of all known tools presently based around version 3 of the OWL API. (We have surely missed some and welcome comments to this post that identify missing ones; we promise to add them and keep tracking them.) Herein is a listing of the 30 or so known OWL API-based tools:

  • Protégé 4 is a free, open source ontology editor and knowledge-base framework based on OWL 2 and centered on the OWL API
  • CEL, FaCT++, HermiT, Pellet, and Racer Pro reasoners provide OWL API wrappers and are also available as reasoner plugins to Protégé 4
  • There is also a FaCT++ port to Java that is also implementing the OWLReasoner and is available as a plugin for Protégé 4.1; it is at version 0.9 with user feedback welcomed
  • structOntology is an open source ontology editor and manager supporting Structured DynamicsconStruct implementation of the Open Semantic Framework (OSF) in Drupal; more information is provided here
  • TrOWL is a Tractable reasoning infrastructure for OWL 2. TrOWL supports both standard TBox and ABox reasoning, as well as conjunctive query answering
  • SKOSEd is a SKOS editor for Protege; just recently made compatible with Protégé 4.1
Please let us know of any missing OWL API tools that should be added to this list by submitting a comment to this post. We will keep this listing current.
  • Populus is a semantic spreadsheet framework using RightField and OPPL for creating OWL ontologies
  • Bubastis is a tool for detecting asserted logical differences between two ontologies, such as between versions. A stand alone version of the tool is also available for download from the EFO tools page. Bubastis is powered by the OWL API
  • Tab2OWL and its download is a Java tool for importing classes into an already existing OWL file. The script uses the OWL API to read in a tab delimited file of class details and create OWL classes from these rows, adding them to an existing ontology
  • S-Match is a semantic matching framework, which provides several semantic matching algorithms and facilities for developing new ones. Currently S-Match contains implementations of the original S-Match semantic matching algorithm, as well as minimal semantic matching algorithm and structure preserving semantic matching algorithm
  • The Alignment API is an API and implementation for expressing and sharing ontology alignments. It uses an RDF format for expressing alignments in a uniform way. Its four main interfaces (Alignment, Cell, Relation and Evaluator) provides these services: storing, finding, and sharing alignments; piping alignment algorithms (improving an existing alignment); manipulating (thresholding and hardening); generating processing output; and comparing alignments
  • The OWLlink API is a Java interface and implementation of the OWLlink protocol on top of the Java-based OWL API. The OWLlink API enables OWL API-based applications to access remote reasoners (so-called OWLlink servers), and it turns any OWL API aware reasoner into an OWLlink server
  • OPPL2 (ontology pre-processing language) is an abstract formalism that allows for manipulating ontologies written in OWL. It is 100% based on the Manchester OWL Syntax; a query language based on OWL (logical) axioms and variables; a scripting language that allows the addition/removal of OWL (logical) axioms. It is available as an Protégé 4.1 plug-in
  • OPPL Patterns It is available as an Protégé 4.1 plug-in
  • Posh (Prolog OWL Shell) is a command line utility that wraps the Thea OWL library to allow for advanced querying and processing of ontologies, combining the power of Prolog and OWL reasoning
  • POPL (Prolog Ontology Processing Language) allows you to write expressive ontology rewrite rules in a high-level declarative fashion using a syntax similar to Manchester syntax
  • OWLTools (aka OWL2LS – OWL2 Life Sciences) is a convenience Java API on top of the OWL API. Code is available here
  • LexOWL is a plug-in for Protégé 4. In order to add more powerful functionality (e.g., inferencing, editing) to the existing infrastructure and align LexGrid more closely with various Semantic Web technologies, the LexOWL plugin for Protégé 4 provides a way for representing the ontologies modeled within the LexGrid environment in OWL. A source for downloading this tool has not been found
  • Apero, a Protégé plug-in that is an ontology debugging tool based on the use of anti-patterns; see http://www.emcl-study.eu/fileadmin/master_theses/thesis_tahwil.pdf
  • DReW is a prototype DL reasoner over LDL+ ontologies and a prototype reasoner for dl-programs over LDL+ ontologies under well-founded semantics. It is not well developed or documented; it can be downloaded here
  • The LingInfo, LexOnto, LexInfo and LMF ontologies are available from the project website, as well as a corresponding Java API with an implementation for the commonly used OWL API
  • Thea2 is a Prolog library that provides complete support for querying and processing OWL 2 ontologies directly from within Prolog programs. Thea2 also offers additional capabilities including a bridge to the Java OWL API and translation of ontologies to Description Logic programs
  • GLOW is a visualization for OWL ontologies, based on Hierarchical Edge Bundles. Hierarchical Edge Bundles is a new visually attractive technique for displaying adjacency relations in hierarchical data, such as concept structures formed by `subclass-of’ and `type-of’ relations. The displayed adjacency relations can be selected from an ontology using a set of common configurations, allowing for intuitive discovery of information. It is a visualization library based on OWL API, as well as a plug-in for Protégé
  • ROWLKit is a simple GUI to reason and query over ontologies written in the OWL 2 QL profile of OWL
  • OBDA Plugin (Ontology-based data access) is an add-on for the Protégé ontology editor aimed at transforming Protégé into a fully fledged OBDA model editor. It provides data source and mapping editors, as well as querying facilities that, in conjunction with an OBDA-enabled reasoner, allows you to design and test every aspect of an OBDA system
  • OntoCAT provides high level abstraction for interacting with ontology resources including local ontology files in standard OWL and OBO formats (via OWL API)
  • SemaRule Navigator is an Eclipse-based toolkit of multiple semWeb tools, built around the OWL API, organized into a pipeline-like system (appears quite complicated)
  • OWLDB (alias Mnemosyne) is a storage system based on object-relational mappings utilising the OWL-API for the W3C Web Ontology Language OWL
  • Finally, for a periodically updated list of “official” extensions, see https://owlapi.svn.sourceforge.net/svnroot/owlapi/v3/branches/owlextensions/.

Addendum

Ignazio Palmisano also graciously suggested these additional sources:

which also further leads to this additional listing:

It is not clear if all of these offer OWL 2 support, let along work with the current OWL API.

Posted by AI3's author, Mike Bergman Posted on September 26, 2011 at 2:52 am in Ontologies, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/977/thirty-owl-api-tools/
The URI to trackback this post is: https://www.mkbergman.com/977/thirty-owl-api-tools/trackback/
Posted:September 12, 2011

Judgment for Semantic TechnologiesFive Unique Advantages for the Enterprise

There have been some notable attempts of late to make elevator pitches [1] for semantic technologies, as well as Lee Feigenbaum’s recent series on Are We Asking the Wrong Question? about semantic technologies [2]. Some have attempted to downplay semantic Web connotations entirely and to replace the pitch with Linked Data (capitalized). These are part of a history of various ways to try to make a business case around semantic approaches [3].

What all of these attempts have in common is a view — an angst, if you will — that somehow semantic approaches have not fulfilled their promise. Marketing has failed semantic approaches. Killer apps have not appeared. The public has not embraced the semantic Web consonant with its destiny. Academics and researchers can not make the semantic argument like entrepreneurs can.

Such hand wringing, I believe, is misplaced on two grounds. First, if one looks to end user apps that solely distinguish themselves by the sizzle they offer, semantic technologies are clearly not essential. There are very effective mash-up and data-intensive sites such as many of the investment sites (Fidelity, TDAmeritrade, Morningstar, among many), real estate sites (Trulia, Zillow, among many), community data sites (American FactFinder, CensusScope, City-Data.com, among many), shopping sites (Amazon, Kayak, among many), data visualization sites (Tableau, Factual, among many), etc. , etc., that work well, are intuitive and integrate much disparate information. For the most part, these sites rely on conventional relational database backends and have little semantic grounding. Effective data-intensive sites do not require semantics per se [4].

Second, despite common perceptions, semantics are in fact becoming pervasive components of many common and conventional Web sites. We see natural language processing (NLP) and extraction technologies becoming common for most search services. Google and Bing sprinkle semantic results and characterizations across their standard search results. Recommendation engines and targeted ad technologies now routinely use semantic approaches. Ontologies are creeping into the commercial spaces once occupied by taxonomies and controlled vocabularies. Semantics-based suggestion systems are now the common technology used. A surprising number of smartphone apps have semantics at their core.

So, I agree with Lee Feigenbaum that we are asking the wrong question. But I would also add that we are not even looking in the right places when we try to understand the role and place of semantic technologies.

The unwise attempt to supplant the idea of semantic technologies with linked data is only furthering this confusion. Linked data is merely a means for publishing and exposing structured data. While linked data can lead to easier automatic consumption of data, it is not necessary to effective semantic approaches and is actually a burden on data publishers [5]. While that burden may be willingly taken by publishers because of its consumption advantages, linked data is by no means an essential precursor to semantic approaches. None of the unique advantages for semantic technologies noted below rely on or need to be preceded by linked data. In semantic speak, linked data is not the same as semantic technologies.

The essential thing to know about semantic technologies is that they are a conceptual and logical foundation to how information is modeled and interrelated. In these senses, semantic technologies are infrastructural and groundings, not applications per se. There is a mindset and worldview associated with the use of semantic technologies that is far more essential to understand than linked data techniques and is certainly more fundamental than elevator pitches or “killer apps.”

Five Unique Advantages

Thus, the argument for semantic technologies needs to be grounded in their foundations. It is within the five unique advantages of semantic technologies described below that the benefits to enterprises ultimately reside.

#1: Modern, Back-end Data Federation

The RDF data model — and its ability to represent the simplest of data up through complicated domain schema and vocabularies via the OWL ontology language — means that any existing schema or structure can be represented. Because of this expressiveness and flexibility, any extant data source or schema can be represented via RDF and its extensions. This breadth means that a common representation for any existing schema may be expressed. That expressiveness, in turn, means that any and all data representations can be described in a canonical way.

A shared, canonical representation of all existing schema and data types means that all of that information can now be federated and interrelated. The canonical means of federating information via the RDF data model is the foundational benefit of semantic technologies. Further, the practice of giving URIs as unique identifiers to all of the constituent items in this approach makes it perfectly suitable to today’s reality of distributed data accessible via the Web [6].

#2: Universal Solvent for Structure

I have stated many times that I have not met a form of structured data I did not like [7]. Any extant data structure or format can be represented as RDF. RDF can readily express information contained within structured (conventional databases), semi-structured (Web page or XML data streams), or unstructured (documents and images) information sources. Indeed, the use of ontologies and entity instance records in RDF is a powerful basis for driving the extraction systems now common for tagging unstructured sources.

(One of the disservices perpetuated by an insistence on linked data is to undercut this representational flexibility of RDF. Since most linked data is merely communicating value-attribute pairs for instance data, virtually any common data format can be used as the transmittal form.)

The ease of representing any existing data format or structure and the ability to extract meaningful structure from unstructured sources makes RDF a “universal solvent” for any and all information. Thus, with only minor conversion or extraction penalties, all information in its extant form can be staged and related together via RDF.

#3: Adaptive, Resilient Schema

A singular difference between semantic technologies (as we practice them) and conventional relational data systems is the use of an open world approach [8]. The relational model is a paradigm where the information must be complete and it must be described by a schema defined in advance. The relational model assumes that the only objects and relationships that exist in the domain are those that are explicitly represented in the database. This makes the closed world of relational systems a very poor choice when attempting to combine information from multiple sources, to deal with uncertainty or incompleteness in the world, or to try to integrate internal, proprietary information with external data.

Semantic technologies, on the other hand, allow domains to be captured and modeled in an incremental manner. As new knowledge is gained or new integrations occur, the underlying schema can be added to and modified without affecting the information that already exists in the system. This adaptability is generally the biggest source of economic benefits to the enterprise from semantic technologies. It is also a benefit that enables experimentation and lowers risk.

#4: Unmatched Productivity

Having all information in a canonical form means that generic tools and applications can be designed to work against that form. That, in turn, leads to user productivity and developer productivity. New datasets, structure and relationships can be added at any time to the system, but how the tools that manipulate that information behave remains unchanged.

User productivity arises from only needing to learn and master a limited number of toolsets. The relationships in the constituent datasets are modeled at the schema (that is, ontology) level. Since manipulation of the information at the user interface level consists of generic paradigms regarding the selection, view or modification of the simple constructs of datasets, types and instances, adding or changing out new data does not change the interface behavior whatsoever. The same bases for manipulating information can be applied no matter the datasets, the types of things within them, or the relationships between things. The behavior of semantic technology applications is very much akin to having generic mashups.

Developer productivity results from leveraging generic interfaces and APIs and not bespoke ones that change every time a new dataset is added to the system. In this regard, ontology-driven applications [9] arising from a properly designed semantic technology framework also work on the simple constructs of datasets, types and instances. The resulting generalization enables the developer to focus on creating logical “packages” of functionality (mapping, viewing, editing, filtering, etc.) designed to operate at the construct level, and not the level of the atomic data.

#5: Natural, Connected Knowledge Systems

All of these factors combine to enable more and disparate information to be assembled and related to one another. That, in turn, supports the idea of capturing entire knowledge domains, which can then be expanded and shifted in direction and emphasis at will. These combinations begin to finally achieve knowledge capture and representation in its desired form.

Any kind of information, any relationship between information, and any perspective on that information can be captured and modeled. When done, the information remains amenable to inspection and manipulation through a set of generic tools. Rather simple and direct converters can move that canonical information to other external forms for use by existing external tools. Similarly, external information in its various forms can be readily converted to the internal canonical form.

These capabilities are the direct opposite to today’s information silos. From its very foundations, semantic technologies are perfectly suited to capture the natural connections and nature of relevant knowledge systems.

A Summary of Advantages Greater than the Parts

There are no other IT approaches available to the enterprise that can come close to matching these unique advantages. The ideal of total information integration, both public and private, with the potential for incremental changes to how that information is captured, manipulated and combined, is exciting. And, it is achievable today.

With semantic technologies, more can be done with less and done faster. It can be done with less risk. And, it can be implemented on a pay-as-you-benefit basis [10] responsive to the current economic climate.

But awareness of this reality is not yet widespread. This lack of awareness is the result of a couple of factors. One factor is that semantic technologies are relatively new and embody a different mindset. Enterprises are only beginning to get acquainted with these potentials. Semantic technologies require both new concepts to be learned, and old prejudices and practices to be questioned.

A second factor is the semantic community itself. The early idea of autonomic agents and the heavy AI emphasis of the initial semantic Web advocacy now feels dated and premature at best. Then, the community hardly improved matters with its shift in emphasis to linked data, which is merely a technique and which completely overlooks the advantages noted above.

However, none of this likely matters. The five unique advantages for enterprises from semantic technologies are real and demonstrable today. While my crystal ball is cloudy as to how fast these realities will become understood and widely embraced, I have no question they will be. The foundational benefits of semantic technologies are compelling.

I think I’ll take this to the bank while others ride the elevator.


[1] This series was called for by Eric Franzon of SemanticWeb.com. Contributions to date have been provided by Sandro Hawke, David Wood, and Mark Montgomery.
[2] See Lee Feigenbaum, 2011. “Why Semantic Web Technologies: Are We Asking the Wrong Question?,” TechnicaLee Speaking blog, August 22, 2011; see http://www.thefigtrees.net/lee/blog/2011/08/why_semantic_web_technologies.html, and its follow up on “The Magic Crank,” August 29, 2011; see http://www.thefigtrees.net/lee/blog/2011/08/the_magic_crank.html. For a further perspective on this issue from Lee’s firm, Cambridge Semantics, see Sean Martin, 2010. “Taking the Tech Out of SemTech,” presentation at the 2010 Semantic Technology Conference, June 23, 2010. See http://www.slideshare.net/LeeFeigenbaum/taking-the-tech-out-of-semtech.
[3] See, for example, Jeff Pollock, 2008. “A Semantic Web Business Case,” Oracle Corporation; see http://www.w3.org/2001/sw/sweo/public/BusinessCase/BusinessCase.pdf.
[4] Indeed, many semantics-based sites are disappointingly ugly with data and triples and URIs shoved in the user’s face rather than sizzle.
[5] Linked data and its linking predicates are also all too often misused or misapplied, leading to poor quality of integrations. See, for example, M.K. Bergman and F. Giasson, 2009. “When Linked Data Rules Fail,” AI3:::Adaptive Innovation blog, November 16, 2009. See https://www.mkbergman.com/846/when-linked-data-rules-fail/.
[6] Greater elaboration on all of these advantages is provided in M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Innovation blog, April 8, 2009. See https://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[7] See M.K. Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” AI3:::Adaptive Innovation blog, January 22, 2009. See https://www.mkbergman.com/471/structs-naive-data-formats-and-the-abox/.
[8] A considerable expansion on this theme is provided in M.K. Bergman, 2009. “‘The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Innovation blog, December 21, 2009. See https://www.mkbergman.com/852/the-open-world-assumption-elephant-in-the-room/.
[9] For a full expansion on this topic, see M.K. Bergman, 2011. “Ontology-driven Apps Using Generic Applications,” AI3:::Adaptive Innovation blog, March 7, 2011. See https://www.mkbergman.com/948/ontology-driven-apps-using-generic-applications/.
[10] See M.K. Bergman, 2010. “‘Pay as You Benefit’: A New Enterprise IT Strategy,” AI3:::Adaptive Innovation blog, July 12, 2010. See https://www.mkbergman.com/896/pay-as-you-benefit-a-new-enterprise-it-strategy/.

Posted by AI3's author, Mike Bergman Posted on September 12, 2011 at 3:11 am in Linked Data, Semantic Enterprise, Semantic Web | Comments (4)
The URI link reference to this post is: https://www.mkbergman.com/974/making-the-argument-for-semantic-technologies/
The URI to trackback this post is: https://www.mkbergman.com/974/making-the-argument-for-semantic-technologies/trackback/
Posted:August 15, 2011

World's Tallest Flagpole; see ref [9]The New Paradigm of ‘Substantive Marketing’ for Innovative IT

This decade has clearly marked a sea change in the move of enterprise software from proprietary to open source, as I have recently discussed [1]. It is instructive that only a mere six years ago I was in heated fights with my then Board about open source; today, that seems so quaint and dated.

Also during this period many have noted how open source has changed the capital required to begin a new software startup [2]. Open source both provides the tooling and the components for cobbling together specialty apps and extensions. Six and seven and even eight figure startup costs common just a decade ago have now dropped to four or five figures. When we see the explosion of hundreds of thousands of smartphone apps we are seeing the glowing residue of these additional sea changes. Dropping startup costs by one to three orders of magnitude is truly democratizing innovation.

But something else has been going on that is changing the face of enterprise software (besides consolidation, another factor I also recently commented on). And that factor is “marketing”. Much less commentary is made about this change, but it, too, is greatly lowering costs and fundamentally changing market penetration strategies. That topic — and my personal experience with it — is the focus of this article.

The Obsolete Recent Past

Besides the few remaining big providers of enterprise software — like IBM, Oracle, HP, SAP — most vendors have totally remade their sales practices of just a few years ago. Large sales forces with big commissions and a year to two year sales cycles can no longer be justified when software license fees and the percentage maintenance annuities that flow from them are dropping rapidly. Today’s mantras are doing more with less and doing it faster, hardly consistent with the traditional enterprise software model. Sure, big enterprises, especially big government and big business, have large sunk costs in legacy systems that will continue to be milked by existing vendors. But the flow is constricting with longer-term trends clear to see. The old enterprise software model is obsolete.

Even if it were not dying, it is hard to square huge investments in sales and marketing when product development has become inexpensive and agile. The proliferation of three-letter marketing acronyms for branding “new” product areas and standard formulas for product hype of just a few years ago also feels old and dated. Cozy relationships with conventional trade press pundits and market analysts seem to be diminishing in importance, possibly because the authoritativeness of their influence is also diminishing. It is harder to justify market firm subscription costs when priority budget items are being cut and new information outlets have emerged.

In response to this, many developers have forsaken the enterprise market for the consumer one. Indeed enterprises themselves are looking more and more to the consumer sector and commodity apps for innovation and answers. But, still, problems unique to enterprises remain and how to effectively reach them in this brave new world is today’s marketing problem for enterprise software vendors.

Most entities today, when opining about these challenges, tend to emphasize the need for “laser focus” and “rifle-shot” targeting of prospects. The advice takes the form of: 1) emphasize well-defined verticals; 2) know your market well; and 3) target and go after your likely prospects. Prospect data mining and targeted ad analysis are the proferred elixirs.

But, there is little evidence such refined methods for prospect identification and targeting are really working. Like politicians doing focus groups and opinion polling to capture the desired “message” of their potential electorates, these are all still “push” models of marketing. Yet we are swamped with pushed messages and marketing everywhere we turn. The model is failing.

Besides message overload, there are two issues with laser targeting. First, despite all that we try to know about ready buyers (for enterprise software), we really don’t know if any particular individual is truly needful, in a position to buy, has the authority to buy, or is the right advocate to make the internal sell. Second, though the idea of “laser” carries with it the image of focus and not flailing, it is in fact expensive to identify the targets and send a focused message their way. Because of these issues, decay rates for laser prospects throughout conventional sales pipelines continue to rise.

A New Marketing ParadigmNew Paradigm Roadsign

There has always been the phenomenon of the “fish jumping into the boat“; that is, the unanticipated inbound inquiry from a previously unknown prospect leading to a surprisingly swift sale. But we have seen this phenomenon increase markedly in recent years. Structured Dynamics‘ current customer base — including recurring customers — comes almost exclusively from this source. As we have noted this trend in comparison with more targeted outreach, we have spent much time trying to understand why it is occurring and how we can leverage what Peter Drucker called the “unexpected success” [3].

What we are seeing, I believe, is a shift from sales to marketing, and within marketing from direct or outbound marketing to a new paradigm of marketing. Others have likened this to inbound marketing [4] or content marketing [5] or permission marketing [6]. What we are seeing at Structured Dynamics bears many resemblances to parts of what is claimed for these other approaches, but not all. And, it is also true that what we are seeing may pertain mostly to innovative IT for emerging enterprise markets, and not a generalized paradigm suitable to other products or markets.

For lack of a better term, what we are seeing we can term “substantive marketing”. By this we mean offering valuable content and solutions-oriented systems for free and without restriction. This shares aspects with content marketing. Then, in keeping with the trend for buyers doing their own research and analysis to fulfill their own needs, similar to the premises of inbound or permission marketing, potential consumers can make their own judgments as to relevance and value of our offerings.

Sometimes, of course, some prospects find our approaches and solutions lacking. Sometimes, they may grab what we have offered for free and use them on their own without compensation to us. But where the match is right — and we need to be honest with both ourselves and the customer when it is not — we can better spend the customer’s limited time and resources to tailor our generic solutions to their specific needs. In doing so, we offer higher value (tailored services) while learning better about another spectrum of consumer need that can virtuously enhance our substantive offerings for the next prospect.

So, let’s decompose these components further to see what they can tell us about this new practice of substantive marketing and how to use it as an engine for moving forward.

Substantive Marketing

The Virtuous Cycle Begins with Substantive Solutions

The premise of substantive marketing is to offer square-deal value to the marketplace in the form of solutions-based content. Like content marketing that offers “the creation or sharing of content for the purpose of engaging current and potential consumer bases” [5], substantive marketing goes even further. The whole basis and premise of the approach is to provide substantive content, in one of more of these areas, preferably all:

  • Knowledge — this substantive area includes papers, commentary, survey results or listings of tools and references useful to the target market
  • Analysis — this content area includes unique analysis of market trends, data, technologies or reviews that pertain to the target market
  • Code — this area relates to the provision of open source code and tools, preferably under licenses that allow users to use the software without restriction (two examples are the Apache 2 license and the MIT license)
  • Documentation — a critical substantive area is the documentation in how to install, use, modify or customize these tools, including a prejudice to APIs and tutorial information
  • Methodologies, workflows and best practices — it is important to also discuss how to properly operate and utilize these tools and information. Taking care to document lessons learned and best practices also helps the user community avoid common mistakes and to speed adoption and utility, and
  • Demos — this area involves setting up (and sharing code and procedures for same) demos that show how the code and its methods actually work. Demos also become first use cases to aid the new user in learning and setting up the code bases.

Further, this substantive content is offered without strings, restrictions or customer fill-in forms. The content is not a come on or a teaser. We are not trying to gather leads or prospect names, because we have no intent to dun them with emails or follow-ups.

This substantive content is as complete as can be to enable new users to adopt the information and tools in their current state without further assistance. (In some cases, the information also educates the marketplace in order to prepare future customers for adoption.) Most importantly, this substantive content is offered for free, either open source (for code) or creative commons for documentation and other content. In return, it is fair to request — and we do — attribution when this material is used.

We have previously termed this complete panoply of substantive content a total open solution [7]. Some might find the provision of such robust information crazy: How can we give away the store of our proprietary knowledge and systems?

But we find this kind of thinking old school. In an open source world where so much information is now available online, with a bit of effort customers can find this information anyway. Rather, our mindset is that customers do not want to pay again for what has already been done, but are willing to pay for what can be done with that knowledge for their own specific problems. Offering the complete storehouse of our knowledge in fact signals our interest in only charging the customer for new answers, new value or new formulations. The customers we like to work with feel they are getting an honest, square deal.

Flagpole Venues Help Increase Awareness

Consider your substantive content to be your flag, a unique banner for conveying and packaging your specific brand. It is thus important to find appropriate flagpoles — in the virtual territories that your customers visit — for raising this content high for them to see. Since the role of these flagpoles is to create awareness in potential prospects — who you do not likely know individually or even by group in advance — it makes sense to raise your offerings up on many flagpoles and on the highest flagpoles. Visibility is the object of the approach.

This approach is distinctly not leafletting or cramming links or emails into as many spaces as possible. The idea of substantive marketing is to fly valuable content high enough that desirous potential customers can discover and then inspect the information on their own, and only if they so choose. In this regard, substantive marketing resembles permission marketing [6].

Being visible helps ensure that the needful, questing prospect that you would never have been able to target on your own is able to see and be aware of your offerings. And, since they are seeking information and answers, your collateral needs to be of a similar nature. Solutions and substance are what they are seeking; what you have run up the flagpole should respond to that.

The mindset here is to respect your prospective customers and to allow them to chose to receive and inspect your offerings, but only if they so choose. If flown in the right venues with the right visibility, customers will see your flags and inspect them if they meet their requirements.

Some of the venues at which you can raise your flags include:

  • Blogs — this venue is especially helpful, since you have complete control over content, message, voice and packaging
  • Social networks — the value of social networks is now accepted, and should be a core component of any visibility strategy. However, it is also important to make sure that your contributions are driven by substance and value and do not become part of the cacophonous background noise
  • Vertical media — there are always existing outlets well-read and -respected by your customer propects. Establishing relationships and value with these third-party outlets can extend your reach
  • Web sites — this venue includes your standard Web sites, of course. But, you should also consider setting up specific project-related sites or sites dedicated to documentation (c.f., our TechWiki site of 300+ technical articles) or to methodologies (the excellent MIKE2.0 site is one great example) or to other ways by which particular content (such as tools with the Sweet Tools site) can raise another flag
  • User forums — user discussion groups and forums also become their own attractants for like-interested prospects, and
  • Conferences and tradeshows — while potentially valuable, presence at conferences and tradeshows must be carefully evaluated. Since participation and opportunity costs are high, the venues should be clearly relevant to your market space with likely decision makers in attendance.

The observant reader will have already concluded that each of these venues develops slowly, and therefore raising visibility is generally a slow-and-steady game that requires patience. Start-up vendors backed by venture firms or those looking for quick visibility and cashout will not find this approach suitable. On the other hand, customer prospects looking for answers and self-sustaining solutions are not much interested in flash in the pan vendors, either.

A Model Responsive to the Changing Nature of Customer Prospects

The real drivers for this changing paradigm come from customer prospects. Sophisticated buyers of enterprise IT and instrumental change agents within organizations share most if not all of these characteristics:

  • They are inundated with marketing messages and jaded about hype and “pushed” messages
  • They are generally knowledgeable about their needs and problem spaces and about approximate technologies. They are eager and desirous of learning independently and know that their recommendations affect their personal reputations and standing within their enterprises
  • With the many volatile external and internal changes, including staff reductions and fluid assignments, leadership for new technology adoption can come from many different and unknown corners of the organization; it is extremely difficult to identify and target prospects
  • The economic and competitive environment places a premium on affordability and low-risk evaluations of new technologies
  • Lock-ins of any kind — be it to specific vendors or technologies — are understood as inherently risky. This understanding is raising the importance of open and standards-based approaches
  • Being the subject of a pushy sales effort is distasteful and a negative to an eventual sale. Education and learning, however, is respected
  • Because of all that is at stake, honesty with no bullshit is highly appreciated. If you as a vendor do not offer an appropriate solution or have fulfillment weaknesses, tell the prospect so. Further, tell them who can supply the solution. One never knows when and where the next problem may arise, and providing trustworthy advice can lead to later engagements.

More often than not we find our customers to have already installed and used our existing substantive materials for some time before they approach us about further work. They appreciate the tutorial information and have taught themselves much in advance. By the time we engage, both parties are able to cost-effectively focus on what is truly missing and needed and to deliver those answers in a quick way. Re-engagements tend to occur when a next set of gaps or challenges arise.

Though it may sound trite or even unbelievable to those who have not yet experienced such a relationship, the square deal value offered by substantive marketing can really lead to true partnerships and trust between vendor and customer. We experience it daily with our customers, and vice versa. We also think this is the adaptive approach that our new environment demands.

The Free Path to Open Source and Solutions

Once prospects learn of our substantive offerings, many may decide independently that what we have is not suitable. Others may simply download and use the information on their own, for which we often never know let alone receive revenue. We are completely fine with this, as shown for three different cases.

First, some of these prospects need no more than what we already have. This increases our user base, increases our visibility and often results in contributions to our forums and documentation.

Then, some of these prospects come to learn they need or want more than what our current offerings provide, leading to two possible forks. In one fork, the second case, they may have sufficient skills internally or with other suppliers to extend the system on their own. Some of this flows back to an improved code base or improved installation or documentation bases.

In the other fork, the third case, they may decide to engage us in tailoring a solution for them. That case is the only one of the three that leads to a direct revenue path.

In all three cases we win, and the customer wins. Maybe enterprise software vendors of decades past rue this reality of lower margins and shared benefits; we agree that the absolute profit potential of substantive marketing is much less. But we gladly accept the more enjoyable work and steady revenue relationships resulting from these changes. We are not engaged in some pollyann-ish altruism here, but in a steely-eyed honest brokering that best serves our own self-interest (and fairly that of the customer, as well).

A Square Deal Baseline for Tailored Services

Great IT product does not come from idle musings or dreamed up functionality. It comes solely and directly from solving customer problems. Only via customers can software be refined and made more broadly usable.

A slipstream of those who have previously become aware and tested our offerings will choose to engage our services. This generally takes the form of an inbound call, where the prospect not only qualifies itself, but also establishes the terms and conditions for the sale. They have chosen to select us; they are fish that have jumped into the boat.

To again quote Peter Drucker, “. . . the aim of marketing is to make selling superfluous. The aim of marketing is to know and understand the customer so well that the product or service fits him and sells itself. Ideally, marketing should result in a customer who is ready to buy. All that should be needed then is to make the product or service available . . .” [8]. This is precisely what I meant earlier about the shift in emphasis from sales to marketing.

Even at this point there may be mismatches in needs and our skills and availabilities. If such is the case, we do not hesitate to say so, and attempt to point the prospect in another direction (from which we also gain invaluable market knowledge). If there is indeed a match, we then proceed to try to find common ground on schedule and budget.

Paradoxically, this square deal and honesty about the readiness and weaknesses of our offerings often leads to forgiveness from our customers. For example, for some time we have lacked automated installation scripts that would make it easier for prospects to install our open semantic framework. But, because of compensating value in other areas, such gaps can be overlooked and tackled later on (indeed, as a current customer is now funding). By not pretending to be everything to everyone, we can offer what we do have without embarrassment and get on with the job of solving problems.

For larger potential engagements, we typically suggest a fixed price initial effort to develop an implementation plan. The interviews and research to support this typical 4- to 6-weeks effort (generally in the $5 K to $10 K range, depending) then result in a detailed fulfillment proposal, with firm tasks, budget and schedule, specific to that customer’s requirements. Just as we respect our prospects’ time and budget, we expect the same and do not conduct these detailed plans without compensation. With respect to fulfillment contracts, we cap contract amount and limit milestone payments to pre-set percentages or time expended, whichever is lower.

This approach ensures we understand the customer’s needs and have budgeted and tasked accordingly. Capped contracts also put the onus on us the contractor to understand our own effort and tasking structures and realities, which leads to better future estimating. For the customer, this approach caps risk and potential exposure, and ensures milestones are being met no matter the time expenditures by us, the contractor. This approach extends our square-deal basis to also embrace risks and payments.

New (and Open Source) Developments Fuel the Substance Pipeline

Thus, when customers engage us, they spend almost solely on new functionality specifically tailored to their needs. In doing so, we suggest they agree to release the new developments they fund as open source. We argue — and customers predominantly agree — that they are already benefitting from lower overall costs because other customers have funded sharable, open source before them. We point out that the new customers that follow them will also be independently creating new functionality, to which they will also later benefit.

(This argument does not apply to specific customer data or ontologies, which are naturally proprietary to the customer. Also, if the customer wants to retain intellectual ownership of extensions, we charge higher development fees.)

Once these new developments are completed, they are fed back into a new baseline of valuable content and code. From this new baseline the cycle of substantive marketing can be augmented anew and perpetuated.

Three Guidelines to Leverage Substantive Marketing

All of these points can really be boiled down to three guidelines for how to make substantive marketing effective:

  • First, whatever your domain or market, provide useful and substantive content. The content you offer is indeed your marketing collateral. Prospective customers can gauge from it directly whether it meets their needs, appears sound and workable, and has value. If you have little of substance to offer, this paradigm is not for you
  • Second, plant many flagpoles and raise your flags high in territories your market prospects are likely to visit. This is a process that requires thoughtfulness and patience. Thoughtfulness, because that is how you determine where to plant your flags. If you yourself are a consumer of what you offer, it is easier to find those venues. And patience, because it takes time to stack valuable content upon valuable content in order to raise visibility
  • And, third, be honest and respectful. Help your prospect work within available budget to achieve the most possible at lowest risk. And help them find others, if need be, who might be better able than you to truly solve their problems.

What we are finding — as we continue to refine our understanding of this new paradigm — is that through substantive marketing the fish are finding us and they sometimes jump into the boat. We like our enterprise customers to pre-qualify themselves and already be “sold” once they knock on the door. One never knows when that phone might ring or the email might come in. But when it does, it often results in a collaborative customer as a partner who is a joy to work with to solve exciting new problems.


[1] M.K. Bergman, 2011. “Declining IT Innovation in the Enterprise,” in AI3:::Adaptive Innovation blog, January 17, 2011. See https://www.mkbergman.com/940/declining-it-innovation-in-the-enterprise/.
[2] Paul Graham has been the most prominent observer of this scene; see P. Graham, 2008. “Why There Aren’t Any More Googles,” April 2008 (see http://www.paulgraham.com/googles.html) and subsequent articles.
[3] See esp. Peter F. Drucker, 1985. Innovation and Entrepreneurialship: Practice and Principals, Harper & Row, New York, NY, 277 pp.
[4] Inbound marketing is a marketing strategy that focuses on getting found by customers. According to David Meerman Scott, inbound marketers “earn their way in” (via publishing helpful information on a blog etc.) in contrast to outbound marketing where they used to have to “buy, beg, or bug their way in” (via paid advertisements, issuing press releases in the hope they get picked up by the trade press, or paying commissioned sales people, respectively). Brian Halligan, cofounder and CEO of HubSpot, claims he first coined the term of inbound marketing.
[5] Content marketing is an umbrella term encompassing all marketing formats that involve the creation or sharing of content for the purpose of engaging current and potential consumer bases. In contrast to traditional marketing methods that aim to increase sales or awareness through interruption techniques, content marketing subscribes to the notion that delivering high-quality, relevant and valuable information to prospects and customers drives profitable consumer action. See also Holger Shulze, 2011. B2B Content Marketing Trends slideshow, see http://www.slideshare.net/hschulze/b2b-content-marketing-report.
[6] Seth Godin coined the term permission marketing wherein marketers obtain permission before advancing to the next step in the purchasing process. It is mostly used by online marketers, notably email marketers and search marketers, as well as certain direct marketers who send a catalog in response to a request. Godin contrasts this approach to traditional “interruption marketing” where messages are sent without prior permission.
[7] See the three-part series, M.K. Bergman, 2010. “Listening to the Enterprise: Total Open Solutions,” “Part 1,” “Part 2” and “Part 3,” AI3:::Adaptive Information blog, May 12 – 31, 2010.
[8] Peter F. Drucker, 1974. Management: Tasks, Responsibilities, Practices. New York, NY: Harper & Row. pp. 864. ISBN 0-06-011092-9.
[9] The intro photo is of the world’s tallest flagpole (at 165 m), in Dushanbe, Tajikistan. The photo is courtesy of CentralAsiaOnline.com.
Posted:August 8, 2011

Geshi NetworkVisualization + Analysis Pushes Aside Cytoscape

Though I never intended it, some posts of mine from a few years back dealing with 26 tools for large-scale graph visualization have been some of the most popular on this site. Indeed, my recommendation for Cytoscape for viewing large-scale graphs ranks within the top 5 posts all time on this site.

When that analysis was done in January 2008 my company was in the midst of needing to process the large UMBEL vocabulary, which now consists of 28,000 concepts. Like anything else, need drives research and demand, and after reviewing many graphing programs, we chose Cytoscape, then provided some ongoing guidelines in its use for semantic Web purposes. We have continued to use it productively in the intervening years.

Like for any tool, one reviews and picks the best at the time of need. Most recently, however, with growing customer usage of large ontologies and the development of our own structOntology editing and managing framework, we have begun to butt up against the limitations of large-scale graph and network analysis. With this post, we announce our new favorite tool for semantic Web network and graph analysis — Gephi — and explain its use and showcase a current example.

The Cytoscape Baseline and Limitations

Three and one-half years ago when I first wrote about Cytoscape, it was at version 2.5. Today, it is at version 2.8, and many aspects have seen improvement (including its Web site). However, in other respects, development has slowed. For example, version 3.x was first discussed more than three years ago; it is still not available today.

Though the system is open source, Cytoscape has also largely been developed with external grant funds. Like other similarly funded projects, once and when grant funds slow, development slows as well. While there has clearly been an active community behind Cytoscape, it is beginning to feel tired and a bit long in the tooth. From a semantic Web standpoint, some of the limitations of the current Cytoscape include:

  • Difficult conversion of existing ontologies — Cytoscape requires creating a CSV input; there was an earlier RDFscape plug-in that held great promise to bridge the software into the RDF and semantic Web sphere, but it has not remained active
  • Network analysis — one of the early and valuable generalized network analysis plug-ins was NetworkAnalyzer; however, that component has not seen active development in three years, and dynamic new generalized modules suitable for social network analysis (SNA) and small-world networks have not been apparent
  • Slow performance and too-frequent crashes — Cytoscape has always had a quirky interface and frequent crashes; later versions are a bit more stable, but usability remains a challenge
  • Largely supported by the biomedical community — from the beginning, Cytoscape was a project of the biomedical community. Most plug-ins still pertain to that space. Because of support for OBO (Open Biomedical and Biological Ontologies) formats and a lack of uptake by the broader semantic Web community, RDF- and OWL-based development has been keenly lacking
  • Aside from PDFs, poor ability to output large graphs in a viewable manner
  • Limited layout support — and poor performance for many of those included with the standard package.

Undoubtedly, were we doing semantic technologies in the biomedical space, we might well develop our own plug-ins and contribute to the Cytoscape project to help overcome some of these limitations. But, because I am a tools geek (see my Sweet Tools listing with nearly 1000 semantic Web and -related tools), I decided to check out the current state of large-scale visualization tools and see if any had made progress on some of our outstanding objectives.

Choosing Gephi and Using It

There are three classes of graph tools in the semantic technology space:

  1. Ontology navigation and discovery, to which the Relation Browser and RelFinder are notable examples
  2. Ontology structure visualization (and sometimes editing), such as the GraphViz (OWLViz) or OntoGraf tools used in Protégé (or the nice FlexViz, again used by the OBO community), and
  3. Large-scale graph visualization in order to gain a complete picture and macro relationships in the ontology.

One could argue that the first two categories have received the most current development attention. But, I would also argue that the third class is one of the most critical:  to understand where one is in a large knowledge space, much better larger-scale visualization and navigation tools are needed. Unfortunately, this third category is also the one that appears to be receiving the least development attention. (To be sure, large-scale graphs pose computational and performance challenges.)

In the nearly four years since my last major survey of 26 tools in this category, the new entrants appear quite limited. I’ve surely overlooked some, but the most notable are Gruff, NAViGaTOR, NetworkX and Gephi [1]. Gruff actually appears to belong most in Category #2; I could find no examples of graphs on the scale of thousands of nodes. NAViGaTOR is biomedical only. NetworkX has no direct semantic graph importing and — while apparently some RDF libraries can be used for manipulating imports — alternative workflows were too complex for me to tackle for initial evaluation. This leaves Gephi as the only potential new candidate.

From a clean Web site to well-designed intro tutorials, first impressions of Gephi are strongly positive. The real proof, of course, was getting it to perform against my real use case tests. For that, I used a “big” ontology for a current client that captures about 3000 different concepts and their relationships and more than 100 properties. What I recount here — from first installing the program and plug-ins and then setting up, analyzing, defining display parameters, and then publishing the results — took me less than a day from a totally cold start. The Gephi program and environment is surprisingly easy to learn, aided by some great tutorials and online info (see concluding section).

The critical enabler for being able to use Gephi for this source and for my purposes is the SemanticWebImport plug-in, recently developed by Fabien Gandon and his team at Inria as part of the Edelweiss project [2]. Once the plug-in is installed, you need only open up the SemanticWebImport tab, give it the URL of your source ontology, and pick the Start button (middle panel):

SemWeb Plug-in for GephiNote the SemanticWebImport tool also has the ability (middle panel) to issue queries to a SPARQL endpoint, the results of which return a results graph (partial) from the source ontology. (This feature is not further discussed herein.) This ontology load and display capability worked without error for the five or six OWL 2 ontologies I initially tested against the system.

Once loaded, an ontology (graph) can be manipulated with a conventional IDE-like interface of tabs and views. In the right-hand panels above we are selecting various network analysis routines to run, in this case Average Degrees. Once one or more of these analysis options is run, we can use the results to then cluster or visualize the graph; the upper left panel shows highlighting the Modularity Class, which is how I did the community (clustering) analysis of our big test ontology. (When run you can also assign different colors to the cluster families.) I also did some filtering of extraneous nodes and properties at this stage and also instructed the system via the ranking analysis to show nodes with more link connections as larger than those nodes with fewer links.

At this juncture, you can also set the scale for varying such display options as linear or some power function. You can also select different graph layout options (lower left panel). There are many layout plug-in options for Gephi. The layout plugin called OpenOrd, for instance, is reported to be able to scale to millions of nodes.

At this point I played extensively with the combination of filters, analysis, clusters, partitions and rankings (as may be separately applied to nodes and edges) to: 1) begin to understand the gross structure and characteristics of the big graph; and 2) refine the ultimate look I wanted my published graph to have.

In our example, I ultimately chose the standard Yifan Hu layout in order to get the communities (clusters) to aggregate close to one another on the graph. I then applied the Parallel Force Atlas layout to organize the nodes and make the spacings more uniform. The parallel aspect of this force-based layout allows these intense calculations to run faster. The result of these two layouts in sequence is then what was used for the results displays.

Upon completion of this analysis, I was ready to publish the graph. One of the best aspects of Gephi is its flexibility and control over outputs. Via the main Preview tab, I was able to do my final configurations for the published graph:

Publication Options for GephiThe graph results from the earlier-worked out filters and clusters and colors are shown in the right-hand Preview pane. On the left-hand side, many aspects of the final display are set, such as labels on or off, font sizes, colors, etc. It is worth looking at the figure above in full size to see some of the options available.

Standard output options include either SVG (vector image) or PDFs, as shown at the lower left, with output size scaling via slider bar. Also, it is possible to do standard saves under a variety of file formats or to do targeted exports.

One really excellent publication option is to create a dynamically zoomable display using the Seadragon technology via a separate Seadragon Web Export plug-in. (However, because of cross-site scripting limitations due to security concerns, I only use that option for specific sites. See next section for the Zoom It option — based on Seadragon — to workaround that limitation.)

Outputs Speak for Themselves

I am very pleased with the advances in display and analysis provided by Gephi. Using the Zoom It alternative [3] to embedded Seadragon, we can see our big ontology example with:

  • All 3000 nodes labeled, with connections shown (though you must must zoom to see) and
  • When zooming (use scroll wheel or + icon) or panning (via mouse down moves), wait a couple of seconds to get the clearest image refresh:

Note: at standard resolution, if this graph were to be rendered in actual size, it would be larger than 7 feet by 7 feet square at full zoom !!!

To compare output options, you may also;

Still, Some Improvements Would be Welcomed

It is notable that Gephi still only versions itself as an “alpha”. There is already a robust user community with promise for much more technology to come.

As an alpha, Gephi is remarkably stable and well-developed. Though clearly useful as is, I measure the state of Gephi against my complete list of desired functionality, with these items still missing:

  • Real-time and interactive navigation — the ability to move through the graph interactively and to issue queries and discover relationships
  • Huge node numbers — perhaps the OpenOrd plug-in somewhat addresses this need. We will be testing Gephi against UMBEL, which is an order of magnitude larger than our test big ontology
  • More node and edge control — Cytoscape still retains the advantage in the degree to which nodes and edges can be graphically styled
  • Full round-tripping — being able to use Gephi in an edit mode would be fantastic; the edit functionality is fairly straightforward, but the ability to round-trip in appropriate formats (OWL, RDF or otherwise) may be the greater sticking point.

Ultimately, of course, as I explained in an earlier presentation on a Normative Landscape for Ontology Tools, we would like to see a full-blown graphical program tie in directly with the OWL API. Some initial attempts toward that have been made with the non-Gephi GLOW visualization approach, but it is still in very early phases with ongoing commitments unknown. Optimally, it would be great to see a Gephi plug-in that ties directly to the OWL API.

In any event, while perhaps Cytoscape development has stalled a bit for semantic technology purposes, Gephi and its SemanticWebImport plug-in have come roaring into the lead. This is a fine toolset that promises usefulness for many years to come.

Some Further Gephi Links

To learn more about Gephi, also see the:

Also, for future developments across the graph visualization spectrum, check out the Wikipedia general visualization tools listing on a periodic basis.


[1] The R open source math and statistics package is very rich with apparently some graph visualization capabilities, such as the dedicated network analysis and visualization project statnet. rrdf may also provide an interesting path for RDF imports. R and its family of tools may indeed be quite promising, but the commitment necessary to R appears quite daunting. Longer-term, R may represent a more powerful upgrade path for our general toolsets. Neo4j is also a rising star in graph databases, with its own visualization components. However, since we did not want to convert our underlying data stores, we also did not test this option.
[2] Erwan Demairy is the lead developer and committer for SemanticWebImport. The first version was released in mid-April 2011.
[3] For presentations like this blog post, the Seadragon JavaScript enforces some security restrictions against cross-site scripting. To overcome that, the option I followed was to:
  • Use Gephi’s SVG export option
  • Open the SVG in Inkscape
  • Expand the size of the diagram as needed (with locked dimensions to prevent distortion)
  • Save As a PNG
  • Go to Zoom It and submit the image file
  • Choose the embed function, and
  • Embed the link provided, which is what is shown above.
(Though Zoom.it also accepts SVG files directly, I found performance to be spotty, with many graphical elements dropped in the final rendering.)

Posted by AI3's author, Mike Bergman Posted on August 8, 2011 at 3:27 am in Ontologies, Open Source, Semantic Web Tools, UMBEL | Comments (3)
The URI link reference to this post is: https://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/
The URI to trackback this post is: https://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/trackback/
Posted:July 25, 2011

WordPressOvercoming the Limitations of WordPress Search

Since the inception of this AI3 blog a bit over six years ago, I have gone through five different approaches to local site search, all geared to overcome the limitations of WordPress‘ native search function. The current and last iteration uses the Relevanssi plug-in, the best I have used so far. (Check it out yourself in the search box to the upper right.) I describe these five iterations in this post.

Iteration #1: Native WordPress Search

When first released, AI3 used the native search that comes with the WordPress installation (when first installed that was WP version 1.5; the current version is at 3.2.1). That was OK when few knew of my site and the number of visitors was low.

But the WP search is known to suck, mostly because of search results based on date posted not relevance and its slow performance. Once I began to get more traffic, it was time for a change.

Iteration #2: Google Custom Search

The option I have kept longest on this site is Google’s Custom Search. When first announced at the end of 2006 it was a real godsend and very innovative. I installed my first version in January 2007 and continued to make modifications and use it up through April 2010. I used it on various sites with many different types of Custom Search implementations.

Unfortunately, to use the free version it is necessary to include ads that Google provides. For a while, this served my purposes, since I was actively trying to learn whether ad revenues were viable for a standard blog and what kinds of traffic are necessary to produce meaningful revenues. However, by early 2010 I had come to the conclusion that — even with a quite popular blog for its niche — that ad revenues would never be that meaningful and it was not worth cluttering up my site. So I ended my experiment with Google ads and, being cheap, chose not to use the paid version of the search service and thus dropped the system.

What I liked:

  • Easy set up
  • Familiar search syntax and interface.

What I did not like:

  • Inclusion of Google ad panels
  • Lack of flexibility is styling search results presentation
  • Need for a Google key
  • Inability to tweak ordering of search results
  • Intrusive Google logos in multiple places.

Iteration #3: Bing Site Owner

Microsoft’s Bing was starting to come on strongly at that time so I decided next to try the Bing Site Owner’s service. I began this new approach immediately upon retiring Google.

What I liked:

  • Very easy set up
  • Acceptable flexibility in styling results
  • Nice popup implementation
  • Not overly intrusive with the Bing (MS) brand.

However, without direct notice, Microsoft ended this service as of April of this year.

What I did not like:

  • Service went dark
  • Cancelled service without any notification (except on the Bing webmaster’s site, a location I never visited)
  • No alternatives to the Bing API 2.0 with its difficult set up.

I was pretty pleased with the Bing service and would likely have continued using it because it wasn’t broke. But, the sudden plug-pulling was offputting.

Thus, I decided, heck, if I was going to have to go through the effort of learning the new Bing API, I might as well learn to do it all myself.

Iteration #4: WPSearch 2

So, it was back to researching options and WP plug-ins on the Web. After assembling the options, I first chose to go with WPSearch 2. The thing that most initially attracted me to this option was its reliance on the Lucene open source search engine, the same option that my company Structured Dynamics uses in its Solr text indexing for the Open Semantic Framework (OSF).

Since my AI3 blog theme is of my own design with many changes over the years, I had lost its original capabilities in having a native search form and search results page. So, my first task after installing the WPSearch plugin and indexing my content was to add these pages to my theme. The WP Codex has an OK set of instructions on creating a search page and related discussion.

There are some valuable tutorials out there that explain how this is done; I refer to them rather than repeat such information here.

I completed this work and kept WPSearch 2 up and active on my site for roughly the past week. But, I also kept trying to achieve some of the aspects I wanted in formatting and organizing search results and became increasingly frustrated. I also experienced numerous freezes and white screens and fatal PHP errors while editing new pages or deleting comment spam that told me I simply had to abandon this option.

In summary, what I liked:

  • Use of Lucene search engine
  • Very fast performance
  • Known search syntax.

What I did not like:

  • Duplicate results
  • Freezes and timeouts when managing comments or new edits
  • Inability to capture total search count (at least with my own PHP skills)
  • Inability to highlight search terms.

I’m sorry that I needed to abandon this option, since I do view highly the underlying Lucene text engine. But, the integration with existing WP functionality and other modules was not fully baked. I think with more work, including exposing more of the Lucene search API functionality, that this option could redeem itself. But, as of today, it is not reliable enough for my site.

Iteration #5: Relevanssi

In trying to find hacks and workarounds to some of the desires and issues noted above, I had come across reference to the Relevanssi plug-in, which appeared to embrace much of what I was looking to achieve. The download is quite small (100 K) and must therefore use the native WP MySQL for the index, but it is feature rich and has a strong relevance-ranking and with ranking flexibility. There is great flexibility and configurability in how search results get presented, also an attraction.

Installation of this system and then indexing was very clean and straightforward. It has a syntax that readily supports the Boolean AND operator (the default behavior I have set for the site) (if the AND search finds no matches, it will automatically do an OR search) and phrase searching, with the prior links showing examples from this blog (also see the search form at upper right).

As implemented, then, here is the listing of major features in Relevanssi:

  • Total number of search results (implemented)
  • Search term highlighting (implemented)
  • Contextual excerpt snippets (implemented)
  • Sort by date (not implemented)
  • Category search (not implemented)
  • Filter by date (not implemented)
  • Filter by category or tag (not implemented).

Here is a screen capture of the complete configuration menu in WordPress for Relevanssi:

Relevanssi Configuration Options

For further information, you may also want to see some more advanced search functions and the Relevanssi knowledge base.

Posted by AI3's author, Mike Bergman Posted on July 25, 2011 at 3:40 am in Blogs and Blogging, Site-related | Comments (1)
The URI link reference to this post is: https://www.mkbergman.com/966/five-iterations-of-site-search/
The URI to trackback this post is: https://www.mkbergman.com/966/five-iterations-of-site-search/trackback/