As we see more collaboration forums emerge, one question that naturally arises is the joint authoring or editing of images. This is particularly important as “official” slide decks or presentations come to the fore.
There are perhaps many different ways to skin this cat. In this article, I describe how to do so using the free, open source SVG editing program, Inkscape.
Like many of you, I have been creating and editing images for years. I am by no means a graphics artist, but images and diagrams have been essential for communicating my work.
Until a few years back, I was totally a bitmap man. I used Paint Shop Pro (bought by Corel in 2004 and getting long in the tooth) and did a lot of copying and pasting.
I switched to Inkscape about two years ago for the following reasons:
Once you have a working image in Inkscape, make sure all collaborators have a copy of the software. Then:
Of course, it is more often the case that not all collaborators may have a copy of Inkscape or that the image began in the SVG format.
The image below began as a Windows Powerpoint clip art file, which has then gone through some modifications. Note the bearded guy’s hand holding the paper is out of registry (because I screwed up in earlier editing, but I also can easily fix because it is a vector image!
). Also note we have the border from Inkscape as suggested above. This file, BTW, is people.png, and was created as a PNG after a screen capture from Inkscape:

When beginning in Powerpoint or as clip art, files in the format of Windows metafile (*.wmf) or extended WMF (*.emf) work well. (For example, you can download and play with the native Inkscape format of people.svg, or the people.wmf or people.emf versions of the image above.) If you already have images in a Powerpoint presentation, save in one of these two formats, with (*.emf) preferred. (EMF is generally better for text.)
You can open or load these files directly into Inkscape. Generally, they will come in as a group of vectors; to edit the pieces, you should “ungroup.”
After editing per the instructions in the previous section, if you need to re-insert back into Powerpoint, please use the *.emf format (and make sure you do not save text as paths).
For example, see the following PNG graphic taken from a Inkscape file (figure_text.svg):

We can save it as an EMF (figure_textpath.emf) to a Powerpoint, with the option of converting text to paths:

Or, we can save it as an EMF (figure_text.emf) to a Powerpoint, only this time not converting text to paths and then “ungrouping” once in Powerpoint:

Note the latter option, text not as path, is the far superior one. However, also note that borders are added to the figures and vertical text is rotated 90o back to horizontal. Nonetheless, the figure is fully editable, including text. Also, if the original Inkscape figures are constructed with lines of the same color as fills, the border conversion also works well.
Frankly, especially with text, because there can be orientation and other changes going from Inkscape to Powerpoint, I recommend using Inkscape and its native SVG for all early modifications and to keep a canonical copy of your images. Then, prior to completion of the deck, save as EMF for import into Powerpoint and then clean up. If changes later need to be made to the graphic, I recommend doing so in Inkscape and then re-importing.
I should note there is an option, as well, in Inkscape to convert raster images to vector ones (use Path -> Trace bitmap … and invoke the multiple scans with colors). This is doable, but involves quite a bit of image copying, manipulation and color separation to achieve workable results. You may want to see further Inkscape’s documentation on tracing, or more fully this reference dealing with color.
Of course, there are likely many other ways to approach these issues of collaboration and sharing. I will leave it to others to suggest and explain those options.
Well, for another client and another purpose, I was goaded into screening my Sweet Tools listing of semantic Web and -related tools and to assemble stuff from every other nook and cranny I could find. The net result is this enclosed listing of some 140 or so tools — most open source — related to semantic Web ontology building in one way or another.
Ever since I wrote my Intrepid Guide to Ontologies nearly three years ago (and one of the more popular articles of this site, though it is now perhaps a bit long in the tooth), I have been intrigued with how these semantic structures are built and maintained. That interest, in no small measure, is why I continue to maintain the Sweet Tools listing.
As far as I know, the following is the largest and most comprehensive listing of ontology building tools available. I broadly interpret the classification of ‘ontology building’; I include, for example, vocabulary extraction and prompting tools, as well as ontology visualization and mapping.
There are some 140 tools, perhaps 90 or so are still in active use. (Given the scope, not every tool could be inspected in detail. Some listed as being perhaps inactive may not be so, and others not in that category perhaps should be.) Of the entire roster of tools, somewhere on the order of 12 to 20 are quite impressive and deserving of local installation, test runs, and close inspection.
There are relatively few tools useful to non-specialists (or useful to engaging knowledgeable publics in the ontology-building exercise). There appear to be key gaps in the entire workflow from domain scoping and initial ontology definition and vocabulary candidates, to longer-term maintenance and revision. For example, spreadsheets would appear to be a possible useful first step in any workflow process (which is why irON is listed), but the spreadsheet tool per se is not listed herein (nor are text editors).
I surely have missed some tools and likely improperly assigned others. Please drop me an email or comment on this post with any revisions or suggestions.
In my own view, there are some tools that definitely deserve a closer look. My favorite candidates — for very different reasons and for very different places in the workflow — are (in no particular order): Apelon DTS, irON, FlexViz, Knoodl, Protégé, diagramic.com, BooWa, COE, ontopia, Anzo, PoolParty, Vine (and voc2rdf), Erca, Graphl, and GrOWL. Each one of these links is more fully described below. Also, all tools in the Vocabulary Prompting Tools category (which also includes extraction) are worth reviewing since all or nearly all have online demos.
Other tools may also be deserving, depending on use case. Some of the more specific analysis and conversion tools, for example, are in the Miscellaneous category.
Also, some purists may quibble with why some tools are listed here (such as inclusion of some stuff related to Topic Maps). Well, my answer to that is there are no real complete solutions, and whatever we can pragmatically do today requires glueing together many disparate parts.
Though all are not relevant, see my post from a couple of years back on large-scale RDF graph software.

If you are like me, you like to clear the decks before the start of major new projects. In Structured Dynamics‘ case, we actually have multiple new initiatives getting underway, so the deck clearing has been especially focused this time.
As a result, we have updated Sweet Tools, AI3’s listing of semantic Web and -related tools, with the addition of some 30 new tools, updates to others, and deletions of five expired entries. The dataset now lists 835 tools. And, as before, there is also now a new structured data view via conStruct (pick the Sweet Tools dataset).
We have also updated SWEETpedia, a listing of 246 research articles that use Wikipedia in one way or another to do semantic-Web related research. Some 20 new papers were added to this update.
Please use the comments section on this post to suggest new tools or new research articles for inclusion in future updates.
I just came across a VC blog pondering the value to a start-up of operating in “Stealth Mode” or not. I’ve amusingly come to the conclusion that all of this — particularly the “stealth” giveaway — is so much marketing hype. When a start-up claims they’re coming out of stealth mode, grab your wallet.
The most interesting and telling example I have of this is Rearden Commerce, which was announced in a breathy cover story in InfoWorld in February 2005 about the company and its founder/CEO Patrick Grady. The company has an obvious “in” with the magazine; in 2001 InfoWorld also carried a similar piece on the predecessor company to Rearden, Talaris Corporaton.
According to a recent Business Week article, Rearden Commerce and its predecessors reaching back to an earlier company called Gazoo founded in 1999 have raised $67 million in venture capital. While it is laudable the founder has reportedly put his own money into the venture, this venture through its massive funding and high-water mark of 80 employees or so hardly qualifies as “stealth.”
As early as 2001 with the same technology and business model, this same firm was pushing the “stealth” moniker. According to an October 2001 press release:
“The company, under its stealth name Gazoo, was selected by Red Herring magazine as one of its ‘Ten to Watch’ in 2001.” [emphasis added]
Even today, though no longer using the active name Talaris Corporation, it has close to 115,000 citations on Yahoo! Notable VCs such as Charter Ventures, Foundation Capital, JAFCo and Empire Capital have backed it through its multiple incubations.
Holmes Report, a marketing company, provides some insight into how the earlier Talaris was spun in 2001:
“The goal of the Talaris launch was to gain mindshare among key business and IT trade press and position Talaris as a ‘different kind of start-up’ with a multi-tiered business model, seasoned executive team and tested product offering.”
[Hmmm; grind me a pound!]
The Holmes Report documents the analyst firms and leading journals and newspapers to which it made outreach. Actually, this outreach is pretty impressive. Good companies do the same all of the time and that is to be lauded. What is to be questioned, however, is how many “stealths” a cat can have. Methinks this one is one too many.
“Stealth” thus appears to be code for an existing company of some duration that has had disappointing traction and now has new financing, a new name, new positioning, or all of the above. So, interested in a start-up that just came out of stealth mode? Let me humbly suggest standard due diligence.
This Friday brown bag leftover was first placed into the AI3 refrigerator on October 13, 2005. No changes have been made to the original posting, except the [grinding] bit.
However, as of last year, Rearden had upped its VC funding to $240 million (can we spell multiple ?). Today, it is now focused on the travel industry. Fly me to the moon!

The beginning of a new year and a new decade is a perfect opportunity to take stock of how the world is changing and how we can change with it. Over the past year I have been writing on many foundational topics relevant to the use of semantic technologies in enterprises.
In this post I bring those threads together to present a unified view of these foundations — some seven pillars — to the open semantic enterprise.
By open semantic enterprise we mean an organization that uses the languages and standards of the semantic Web, including RDF, RDFS, OWL, SPARQL and others to integrate existing information assets, using the best practices of linked data and the open world assumption, and targeting knowledge management applications. It does so using some or all of the seven foundational pieces (”pillars”) noted herein.
The foundational approaches to the open semantic enterprise do not necessarily mean open data nor open source (though they are suitable for these purposes with many open source tools available [3]). The techniques can equivalently be applied to internal, closed, proprietary data and structures. The techniques can themselves be used as a basis for bringing external information into the enterprise. ‘Open’ is in reference to the critical use of the open world assumption.
These practices do not require replacing current systems and assets; they can be applied equally to public or proprietary information; and they can be tested and deployed incrementally at low risk and cost. The very foundations of the practice encourage a learn-as-you-go approach and active and agile adaptation. While embracing the open semantic enterprise can lead to quite disruptive benefits and changes, it can be accomplished as such with minimal disruption in itself. This is its most compelling aspect.
Like any change in practice or learning, embracing the open semantic enterprise is fundamentally a people process. This is the pivotal piece to the puzzle, but also the one that does not lend itself to ready formula about pillars or best practices. Leadership and vision is necessary to begin the process. People are the fuel for impelling it. So, we’ll take this fuel as a given below, and concentrate instead on the mechanics and techniques by which this vision can be achieved. In this sense, then, there are really eight pillars to the open semantic enterprise, with people residing at the apex.
This article is synthetic, with links to (largely) my preparatory blog postings and topics that preceded it. Assuming you are interested in becoming one of those leaders who wants to bring the benefits of an open semantic enterprise to your organization, I encourage you to follow the reference links for more background and detail.
A Review of the BenefitsOK, so what’s the big deal about an open semantic enterprise and why should my organization care?
We should first be clear that the natural scope of the open semantic enterprise is in knowledge management and representation [1]. Suitable applications include data federation, data warehousing, search, enterprise information integration, business intelligence, competitive intelligence, knowledge representation, and so forth [2]. In the knowledge domain, the benefits for embracing the open semantic enterprise can be summarized as greater insight with lower risk, lower cost, faster deployment, and more agile responsiveness.
The intersection of knowledge domain, semantic technologies and the approaches herein means it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with a focus on early, high-value applications and domains.
There is absolutely no need to abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives.
Embracing the pillars of the open semantic enterprise brings these knowledge management benefits:
Moreover, by building on successful Web architectures, we can also put in place loosely coupled, distributed systems that can grow and interoperate in a decentralized manner. These also happen to be perfect architectures for flexible collaboration systems and networks.
These benefits arise both from individual pillars in the open semantic enterprise foundation, as well as in the interactions between them. Let’s now re-introduce these seven pillars.
Pillar #1: The RDF Data ModelAs I stated on the occasion of the 10th birthday of the Resource Description Framework data model, I belief RDF is the single most important foundation to the open semantic enterprise [4]. RDF can be applied equally to all structured, semi-structured and unstructured content. By defining new types and predicates, it is possible to create more expressive vocabularies within RDF. This expressiveness enables RDF to define controlled vocabularies with exact semantics. These features make RDF a powerful data model and language for data federation and interoperability across disparate datasets.
Via various processors or extractors, RDF can capture and convey the metadata or information in unstructured (say, text), semi-structured (say, HTML documents) or structured sources (say, standard databases). This makes RDF almost a “universal solvent” for representing data structure.
Because of this universality, there are now more than 150 off-the-shelf ‘RDFizers’ for converting various non-RDF notations (data formats and serializations) to RDF [5]. Because of its diversity of serializations and simple data model, it is also easy to create new converters. Once in a common RDF representation, it is easy to incorporate new datasets or new attributes. It is also easy to aggregate disparate data sources as if they came from a single source. This enables meaningful compositions of data from different applications regardless of format or serialization.
What this practically means is that the integration layer can be based on RDF, but that all source data and schema can still reside in their native forms [6]. If it is easier or more convenient to author, transfer or represent data in non-RDF forms, great [7]. RDF is only necessary at the point of federation, and not all knowledge workers need be versed in the framework.
Pillar #2: Linked Data TechniquesLinked data is a set of best practices for publishing and deploying instance and class data using the RDF data model. Two of the best practices are to name the data objects using uniform resource identifiers (URIs), and to expose the data for access via the HTTP protocol. Both of these practices enable the Web to become a distributed database, which also means that Web architectures can also be readily employed (see Pillar #5 below).
Linked data is applicable to public or enterprise data, open or proprietary. It is really straightforward to employ. Structured Dynamics has published a useful FAQ on linked data.
Additional linked data best practices relate to how to characterize and classify data, especially in the use of predicates with the proper semantics for establishing the degree of relatedness for linked data items from disparate sources.
Linked data has been a frequent topic of this blog, including how adding linkages creates value for existing data, with a four-part series about a year ago on linked data best practices [8]. As advocated by Structured Dynamics, our linked data best practices are geared to data interconnections, interrelationships and context that is equally useful to both humans and machine agents.
Pillar #3: Adaptive OntologiesOntologies are the guiding structures for how information is interrelated and made coherent using RDF and its related schema and ontology vocabularies, RDFS and OWL [10]. Thousands of off-the-shelf ontologies exist — a minority of which are suitable for re-use — and new ones appropriate to any domain or scope at hand can be readily constructed.
In standard form, semantic Web ontologies may range from the small and simple to the large and complex, and may perform the roles of defining relationships among concepts, integrating instance data, orienting to other knowledge and domains, or mapping to other schema [11]. These are explicit uses in the way that we construct ontologies; we also believe it is important to keep concept definitions and relationships expressed separately from instance data and their attributes [9].
But, in addition to these standard roles, we also look to ontologies to stand on their own as guiding structures for ontology-driven applications (see next pillar). With a relatively few minor and new best practices, ontologies can take on the double role of informing user interfaces in addition to standard information integration.
In this vein we term our structures adaptive ontologies [11,12,13]. Some of the user interface considerations that can be driven by adaptive ontologies include: attribute labels and tooltips; navigation and browsing structures and trees; menu structures; auto-completion of entered data; contextual dropdown list choices; spell checkers; online help systems; etc. Put another way, what makes an ontology adaptive is to supplement the standard machine-readable purpose of ontologies to add human-readable labels, synonyms, definitions and the like.
A neat trick occurs with this slight expansion of roles. The knowledge management effort can now shift to the actual description, nature and relationships of the information environment. In other words, ontologies themselves become the focus of effort and development. The KM problem no longer needs to be abstracted to the IT department or third-party software. The actual concepts, terminology and relations that comprise coherent ontologies now become the explicit focus of KM activities.
Any existing structure (or multiples thereof) can become a starting basis for these ontologies and their vocabularies, from spreadsheets to naïve data structures and lists and taxonomies. So, while producing an operating ontology that meets the best practice thresholds noted herein has certain requirements, kicking off or contributing to this process poses few technical or technology demands.
The skills needed to create these adaptive ontologies are logic, coherent thinking and domain knowledge. That is, any subject matter expert or knowledge worker likely has the necessary skills to contribute to useful ontology development and refinement. With adaptive ontologies powering ontology-driven apps (see next), we thus see a shift in roles and responsibilities away from IT to the knowledge workers themselves. This shift acts to democratize the knowledge management function and flatten the organization.
Pillar #4: Ontology-driven ApplicationsThe complement to adaptive ontologies are ontology-driven applications. By definition, ontology-driven apps are modular, generic software applications designed to operate in accordance with the specifications contained in an adaptive ontology. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies, as supplemented by the human and user interface roles noted above [11,12,13].
Ontology-driven apps fulfill specific generic tasks. Examples of current ontology-driven apps include imports and exports in various formats, dataset creation and management, data record creation and management, reporting, browsing, searching, data visualization, user access rights and permissions, and similar. These applications provide their specific functionality in response to the specifications in the ontologies fed to them.
The applications are designed more similarly to widgets or API-based frameworks than to the dedicated software of the past, though the dedicated functionality (e.g., graphing, reporting, etc.) is obviously quite similar. The major change in these ontology-driven apps is to accommodate a relatively common abstraction layer that responds to the structure and conventions of the guiding ontologies. The major advantage is that single generic applications can supply shared functionality based on any properly constructed adaptive ontology.
This design thus limits software brittleness and maximizes software re-use. Moreover, as noted above, it shifts the locus of effort from software development and maintenance to the creation and modification of knowledge structures. The KM emphasis can shift from programming and software to logic and terminology [12].
Pillar #5: A Web-oriented ArchitectureA Web-oriented architecture (WOA) is a subset of the service-oriented architectural (SOA) style, wherein discrete functions are packaged into modular and shareable elements (”services”) that are made available in a distributed and loosely coupled manner. WOA uses the representational state transfer (REST) style. REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it [14].
REST and WOA stand in contrast to earlier Web service styles that are often known by the WS-* acronym (such as WSDL, etc.). WOA has proven itself to be highly scalable and robust for decentralized users since all messages and interactions are self-contained.
Enterprises have much to learn from the Web’s success. WOA has a simple design with REST and idempotent operations, simple messaging, distributed and modular services, and simple interfaces. It has a natural synergy with linked data via the use of URI identifiers and the HTTP transport protocol. As we see with the explosion of searchable dynamic databases exposed via the Web, so too can we envision the same architecture and design providing a distributed framework for data federation. Our daily experience with browser access of the Web shows how incredibly diverse and distributed systems can meaningfully interoperate [15].
This same architecture has worked beautifully in linking documents; it is now pointing the way to linking data; and we are seeing but the first phases of linking people and groups together via meaningful collaboration. While generally based on only the most rudimentary basis of connections, today’s social networking platforms are changing the nature of contacts and interaction.
The foundations herein provide a basis for marrying data and documents in a design geared from the ground up for collaboration. These capabilities are proven and deployable today. The only unclear aspects will be the scale and nature of the benefits [16].
Pillar #6: An Incremental, Layered ApproachTo this point, you’ll note that we have been speaking in what are essentially “layers”. We began with existing assets, both internal and external, in many diverse formats. These are then converted or transformed into RDF-capable forms. These various sources are then exposed via a WOA Web services layer for distributed and loosely-coupled access. Then, we integrate and federate this information via adaptive ontologies, which then can be searched, inspected and managed via ontology-driven apps. We have presented this layered architecture before [13], and have also expressed this design in relation to current Structured Dynamics’ products [17].
A slight update of this layered view is presented below, made even more general for the purposes of this foundational discussion:
Semantic technology does not change or alter the fact that most activities of the enterprise are transactional, communicative or documentary in nature. Structured, relational data systems for transactions or records are proven, performant and understood. On its very face, it should be clear that the meaning of these activities — their semantics, if you will — is by nature an augmentation or added layer to how to conduct the activities themselves.
This simple truth affirms that semantic technologies are not a starting basis, then, for these activities, but a way of expressing and interoperating their outcomes. Sure, some semantic understanding and common vocabularies at the front end can help bring consistency and a common language to an enterprise’s activities. This is good practice, and the more that can be done within reason while not stifling innovation, all the better. But we all know that the budget department and function has its own way of doing things separate from sales or R&D. And that is perfectly OK and natural.
Clearly, then, an obvious benefit to the semantic enterprise is to federate across existing data silos. This should be an objective of the first semantic “layer”, and to do so in a way that leverages existing information already in hand. This approach is inherently incremental; if done right, it is also low cost and low risk.
Pillar #7: The Open World MindsetAs these pillars took shape in our thinking and arguments over the past year, an illusive piece seemed always to be missing. It was like having one of those meaningful dreams, and then waking up in the morning wracking your memory trying to recall that essential, missing insight.
As I most recently wrote [1], that missing piece for this story is the open world assumption (OWA). I argue that this somewhat obscure concept holds within it the key as to why there have been decades of too-frequent failures in the enterprise in business intelligence, data warehousing, data integration and federation, and knowledge management.
Enterprises have been captive to the mindset of traditional relational data management and its (most often unstated) closed world assumption (CWA). Given the success of relational systems for transaction and operational systems — applications for which they are still clearly superior — it is understandable and not surprising that this same mindset has seemed logical for knowledge management problems as well. But knowledge and KM are by their nature incomplete, changing and uncertain. A closed-world mindset carries with it certainty and logic implications not supportable by real circumstances.
This is not an esoteric point, but a fundamental one. How one thinks about the world and evaluates it is pivotal to what can be learned and how and with what information. Transactions require completeness and performance; insight requires drawing connections in the face of incompleteness or unknowns.
The absolute applicability of the semantic Web stack to an open-world circumstance is the elephant in the room [1]. By itself, the open world mindset provides no assurance of gaining insight or wisdom. But, absent it, we place thresholds on information and understanding that may neither be affordable nor achievable with traditional, closed-world approaches.
And, by either serendipity or some cosmic beauty, the open world mindset also enables incremental development, testing and refinement. Even if my basic argument of the open world advantage for knowledge management purposes is wrong, we can test that premise at low cost and risk. So, within available budget, pick a doable proof-of-concept, and decide for yourself.
The Foundations for the Open Semantic EnterpriseThe seven pillars above are not magic bullets and each is likely not absolutely essential. But, based on today’s understandings and with still-emerging use cases being developed, we can see our open semantic enterprise as resulting from the interplay of these seven factors:

Thirty years of disappointing knowledge management projects and much wasted money and effort compel that better ways must be found. On the other hand, until recently, too much of the semantic Web discussion has been either revolutionary (“change everything!!”) or argued from pie-in-the-sky bases. Something needs to give.
Our work over the past few years — but especially as focused in the last 12 months — tells us that meaningful semantic Web initiatives can be mounted in the enterprise with potentially huge benefits, all at manageable risks and costs. These seven pillars point to way to how this might happen. What is now required is that eighth pillar — you.
Structured Dynamics and its Citizen DAN project has been selected as one of the finalists to proceed with a formal proposal for the 2010 $5 million Knight News Challenge. The proposal extends SD’s basic structWSF and conStruct Drupal frameworks to provide a data appliance and network (DAN) to support citizen journalists with data and analysis at the local, community level.
Thanks to all of you who submitted votes in support of the earlier draft proposal. The News Challenge received 2,489 proposals for the 2010 contest, according to Jose Zamora, journalism program associate at the Knight Foundation. According to the Nieman Journalism Lab, Zamora said 65 percent of proposals came through the closed category and 35 percent were open.
The next-round full proposals are due by January 31. Eventual winners are slated to be announced around mid-June 2010.
According to iProspect, about 56 percent of users use search engines every day, based on a population of which more than 70 percent use the Internet more than 10 hours per week.[1] The average knowledge worker spends 2.3 hrs per day — or about 25% of work time — searching for critical job information.[2] IDC estimates that enterprises employing 1,000 knowledge workers may waste well over $6 million per year each in searching for information that does not exist, failing to find information that does, or recreating information that could have been found but was not.[3]
Vendors and customers often use time savings by knowledge workers as a key rationale for justifying a document or content initiative. This comes about because many studies over the years have noted that white collar employees spend a consistent 20% to 25% of their time seeking information. The premise is that more effective search will save time and drop these percentages. For example, EDS has suggested that improvements of 50 percent in the time spent searching for data can be achieved through improved consolidation and access to data.[4]
Using these premises, consultants often calculate that every 1% reduction in the total work time devoted to search works out illustratively on a fully burdened basis as a big cost savings benefit:
$50,000 (base salary) * 1.8 (burden rate) * 1.0% = $900/ employee
Beware such facile analysis!
The fact that many studies over the years have noted white collar employees spend a consistent 20% to 25% of their time devoted to search suggests it is the “satisficing” allocation of time to information search. (In other words, knowledge workers are willing to devote a quarter of their time to finding relevant information; the remainder for analysis and documentation.)
Thus, while better tools to aid better discovery may lead to finding better information and making better decisions more productively — an important justification in itself — there may not result a strict time or labor savings from more efficient search.[5] Be careful of justifying project expenditures based on “time savings” related to search. Search is likely to remain the “25% solution.” The more relevant question is whether the time that is spent on search produces better information or not.
This Friday brown bag leftover was first placed into the AI3 refrigerator on September 14, 2005. No changes have been made to the original posting.
In speaking of the semantic Web, it is not infrequent that the open world assumption (OWA) gets mentioned. What this post argues is that this somewhat obscure concept may hold within it the key as to why there have been decades of too-frequent failures in the enterprise in business intelligence, data warehousing, data integration and federation, and knowledge management.
This is a fairly bold assertion. In order to support it, we first need to look to the logic and mindset assumptions associated with traditional relational data management and the semantic Web. We then need to look to the nature of knowledge itself and its relation to data federation. It is in this intersection that the key of decades of faulty premises may reside.
The main argument is that the closed world assumption (CWA) and its prevalent mindset in traditional database systems have hindered the ability of enterprises and the vendors that support them to adopt incremental, low-risk means to knowledge systems and management. CWA, in turn, has led to over-engineered schema, too-complicated architectures and massive specification efforts that have led to high deployment costs, blown schedules and brittleness.
The good news is that abandoning these failed practices and embracing the open world approach can be done immediately based on existing assets. Simply shifting from the closed world to open world premise can, I argue, improve the odds for enterprise IT success in these areas.
It is time to meet the elephant in the room.
It is, of course, a bit of editorial hyperbole to label most enterprise initiatives in business intelligence and knowledge management as being failures over the past few decades. And, insofar as failures have occurred, I also do not believe they are the result of vendor greed or cynicism, or IT management mistakes or incompetence. Rather, I believe the fault resides in the attempt to pound a square peg (relational model) into a round hole (knowledge representation).
The scope of these failures is not known. We have seen anecdotal claims of trillions of dollars in annual loses due to IT project failures worldwide; failure rates for major IT projects in the 65% to 80% ranges; and analysis of waste and failures in individual firms that are fairly eye-popping [1]. The real point of this post is not to try to quantify these problems. However, in my many years within IT it has been a common perception and concern that many — if not most — large-scale information technology deployments have disappointed in one way or another.
These disappointments range from cost overruns, to late delivery, to unmet objectives, or to low user acceptance. Many initiatives are simply cancelled before any such metrics can be documented. Whatever the absolute quantification, I think most experienced IT managers and executives would agree that these failures and disappointments have been all too commonplace.
Why might this be?
I truly believe the reasons for these disappointments do not reside in bad faith or incompetence. The potential importance of IT knowledge projects to improve competitive position, lower costs, or aid innovation for new markets is understood by all. Dilbert aside, I find it simply incomprehensible that disappointments or failures are rooted in these causes.
Rather, I suspect the root cause resides in the success of the relational model in the enterprise.
As transaction systems and for modeling narrowly bound and structured domains (such as products, inventory or customer lists), the relational model and its proven and optimized RDBMs and SQL query language have been resounding successes. It is natural to take a successful approach and try to extend it to other areas.
However, beginning with data warehouses in the 1980s, business intelligence (BI) systems in the 1990s, and the general issue of most enterprise information being bound up in documents for decades, the application of the relational model to these areas has been disappointing.
The reasons for this do not reside in areas such as storage or hardware; these areas have seen remarkable improvements over the decades. Rather, the problem resides in the nature of the relational model itself, and its lack of suitability to knowledge-based problems.
I have noted the importance of the open world assumption to the semantic enterprise in many of my more recent posts [3,4]. But I, like many others, often refer to the open world assumption with facile summaries such as it means that a lack of information does not imply the missing information to be false. Yet to fully understand the implications of OWA and many of its associated assumptions, it is necessary to delve deeper.
I am using here a shorthand that poses the closed world assumption (CWA) vs. the open world assumption (OWA). Actually, the data models behind these approaches (Datalog or non-monotonic logic in the case of CWA; monotonic in the case of OWA [5]; OWA is also firmly grounded in description logics [4]) tend be coupled with a few other assumptions. I use the shorthand of relational approach vs. (open) semantic Web approach to contrast these two models.
There are instances where the relational model can embrace the open world assumption (for example, the null in SQL) and there are instances where semantic Web approaches can be closed world (as with frame logic or Prolog or other special considerations; see conclusion). But, as generally applied and as generally understood, this contrast between typical relational practice and the semantic Web (based on RDF and OWL) tends to hold.
From a theoretical standpoint, I have found the treatment of Patel-Schneider and Horrocks [6] to be most useful in comparing these approaches. However, the Description Logics Handbook and some other varied sources are also helpful [7,5]. Much of the technical aspects summarized in the table below are from these sources; I refer you to these sources for more informed technical discussions:
| Relational Approach | (Open) Semantic Web Approach |
|
Closed World Assumption (CWA) That which is not known to be true is presumed to be false; it needs to be explicitly stated as true. Negation as failure (NAF) is a related assumption, since it assumes as false every predicate that cannot be proven to be true. Under CWA, any statement not known to be true is false. Everything is prohibited until it is permitted. |
Open World Assumption (OWA) The lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Everything is permitted until it is prohibited. |
|
Unique Name Assumption (UNA) The unique name assumption (UNA) is premised that different names always refer to different entities in the world. |
Duplicate Labels Allowed OWL allows different synonym labels to be used for the same object; same names may refer to different objects. Identity assertions must be explicitly stated. |
|
Complete Information The data system at hand is assumed to be complete. (Missing information is often handled via the null statement in SQL, but that has been controversial and contentious in its own right.) This is also known as the domain-closure assumption. |
Incomplete Information A central tenet of OWA is that information is incomplete. A corollary is that the attributes of specific objects or instances may also be incomplete or partially known. |
|
Single Schema (one world) A single schema is necessary to define the scope and interpretation of the world (domain at hand). |
Many World Interpretations Schema and data instance assertions are kept separate. Multiple interpretations (worlds) for the same data are possible. |
|
Integrity Constraints Integrity constraints prevent “incorrect” values from being asserted in the relational model. It is useful for validation/parsing/data input and is related to the single model that contains only the facts asserted. Strict cardinality is used for checking validation. |
Logical Axioms (restrictions) Logical axioms provide restrictions through property domains and ranges. Everything can be true unless proven otherwise, and multiple possible models can satisfy the axioms. This provides more powerful inferencing, though can also be unintuitive at times. Cardinality and range restrictions exhibit different behavior for objects (inferred) or datatypes. |
|
Non-monotonic Logic The set of conclusions warranted on the basis of a given knowledge base does not increase (in fact, it likely shrinks) with the size of the knowledge base [5]. |
Monotonic Logic The hypotheses of any derived fact may be freely extended with additional assumptions. Additional assertions tend to reduce the inferences or entailments that can be applied. A new piece of knowledge cannot reduce what is known [5]. New knowledge can arise through inference. |
|
Fixed and Brittle Changing the schema requires re-architecting the database; not inherently extensible. |
Reusable and Extensible Designed from the ground up to reuse existing ontologies (axioms) and to be extensible. Database design and management can be more agile, with schema evolving incrementally. |
|
Flat Structure; Strong Typing Information organized into flat tables; linkages and connections between tables based on foreign keys or joins. Strong data typing orientation. |
Graph Structure; Open Typing Inherent graph structure, supporting of linkage and connectivity analysis. Datatypes are inherently loose, though axioms can add strong types. Datatypes treated in the same way as classes, and datatype values are treated in the same way as individual identiers (i.e., a data value is treated as referring to an object). |
|
Querying and Tooling SQL and query optimizers well developed. Tooling well developed. Disjunction not supported; negation must be accommodated through approaches such as NAF. Sums and counts are easier due to unique name premise. Answer closure (one answer passable to a next calculation) is easier than OWA. Most tools are not suitable for any arbitrary schema. |
Querying and Tooling SPARQL and emerging rule languages used for querying; performance at scale and with broad distribution a concern. Queries require contextual information for proper set selection. Negation and disjunction are allowed and are powerful constructs. Tools generally less developed. Exciting opportunities for ontology-driven applications working against a small set of generic tools. |
In well-characterized or self-contained domains (seats on a plane, books in a library, customers of a company, products sold via distribution channels), the traditional relational model works well. A closed-world assumption is performant for transaction operations with easier data validation. The number of negative facts about a given domain is typically much greater than the number of the positive ones. So, in many bounded applications, the number of negative facts is so large that their explicit representation can become practically impossible [7]. In such cases, it is simpler and shorter to state known “true” statements than to enumerate all “false” conditions.
However, the relational model is a paradigm where the information must be complete and it must be described by a single schema. Traditional databases require an agreement on a schema, which must be made before data can be stored and queried. The relational model assumes that the only objects and relationships that exist in the domain are those that are explicitly represented in the database, and that names uniquely identify objects in this domain. The result of these assumptions is that there is a single (canonical) model for relational systems where objects and relationships are in a one-to-one correspondence with the data in the database [6].
This makes CWA and its related assumptions a very poor choice when attempting to combine information from multiple sources, to deal with uncertainty or incompleteness in the world, or to try to integrate internal, proprietary information with external data.
The process of describing an open, semantic Web “world” can proceed incrementally, sequentially asserting new statements or conditions. The schema in the open semantic Web — the ontology — consists of sets of statements (called axioms) that describe characteristics that must be satisfied by the ontology designer’s idea of “reasonable” states of the world. Formally, such statements correspond to logical sentences, and an ontology corresponds to a logical theory [6].
Irregularity and incompleteness are toxic to relational model design. In the open semantic Web, data that is structured differently can still be stored together via RDF triple statements (subject – predicate – object). For example, OWA allows suppliers without cities and names to be stored along alongside suppliers with that information. Information can be combined about similar objects or individuals even though they have different or non-overlapping attributes. Duplicate checking now occurs based on the logic of the system and not unique name evaluations. Data validation in OWA systems can both become more complicated (via testing against restriction statements) or partially easier (via inference).
It is interesting to note that the theoretical underpinnings of CWA by Reiter [8] began to be understood about the same time (1978) that data federation and knowledge representation (KR) activities also began to come to the fore. CWA and later work on (for example) default reasoning [5] appeared to have informed early work in description logics and its alternative OWA approach. This heavily influenced the development of the semantic Web languages RDF and OWL. However, the early path toward KM work based on the relational model also appears to have been set in this timeframe.
We are still reaping the whirlwind from this unfortunate early choice of the relational model for KR, KM and BI purposes. Moreover, though there is quite a bit of theoretical and logical discussion of the alternative OWA and CWA data models, there are surprisingly few discussions of what the implications of these models are to the enterprise. (That is, the elephant in the room.) The next two sections tackle this gap.
The above should make clear that the relational model and CWA are appropriate for defined and bounded systems. However, many of the new knowledge economy challenges are anything but defined and bounded. These applications all reside in the broad category of knowledge management (KM), and include such applications as data federation, data warehousing, enterprise information integration, business intelligence, competitive intelligence, knowledge representation, and so forth.
Let’s looks at the characteristics of such knowledge systems and why they are more appropriately modeled through the open world assumption (OWA) rather than the relational model and CWA:
To be sure, there are many circumstances where large stores of instance data and their analysis are necessary for knowledge purposes. In these cases, hybrid CWA-OWA systems (see conclusion) may make sense.
But, as these points emphasize, the general assembly and organization of knowledge is open world in nature. Trying to fit KM and related applications into the straightjacket of the relational model is folly. The relational model and CWA for KM is the elephant in the room. Three decades of failures and disappointments affirm this fact.
Besides the native match of knowledge systems with OWA, there are sound business arguments for embracing the (open) semantic enterprise as well. These arguments can be summarized as lower risk, lower cost, faster deployment, and more agile responsiveness. What is there not to love?
It should now be clear that it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with a focus on early, high-value applications and domains.
Open world does not necessarily mean open data and it does not mean open source. Open world is simply a way to think about the information we have and how we act on it. OWA technologies are neutral to the question of open or public sources. The techniques can equivalently be applied to internal, closed, proprietary data and structures. Moreover, the technologies can themselves be used as a basis for bringing external information into the enterprise. An open world assumption merely asserts that we never have all necessary information and lacking that information does not itself lead to any conclusions.
Further, we need not abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives. However, in leveraging those assets, it is important that the enterprise begin to embrace and understand the open world assumption.
We also see that RDF and OWL, while important behind the scenes as a canonical data model and languages for organizing this information, need not be exposed as such to most users. Most instance data can be expressed as is with the data languages of choice such as XML, JSON or whatever. We are merely using the techniques of the (open) semantic Web as the data model to organize our information assets at hand. These assets need not themselves be represented in the native RDF or OWL languages.
Thus, open world frameworks provide some incredibly important benefits for knowledge management applications in the enterprise:
One might argue, as we believe, that the biggest impediment to the semantic enterprise is the mind shift necessary to start thinking about and accepting the open world premise. Again, this perspective is not applicable to all problems and domains. But, where it is, much can be left in place and leveraged with semantic technologies, so long as the enterprise begins to look at these existing assets through a different open-world lens.
In most real world circumstances, there is much we don’t know and we interact in complex and external environments. Knowledge management inherently occupies this space. Ultimately, data interoperability implies a global context. Open world is the proper logic premise for these circumstances. Via the OWA framework, we can readily change and grow our conceptual understanding and coverage of the world, including incorporation of external ontologies and data. Since this can easily co-exist with underlying closed-world data, the semantic enterprise can readily bridge both worlds.
So, we can now define the open semantic enterprise as one that embraces OWA for its knowledge management applications and engages in rapid and low-risk testing of incremental learning. The open world assumption is the proper framework to reverse decades of failure and disappointment for knowledge projects in the enterprise.
In our own discussions about ABox – TBox splits [10], we have, in essence, supported a hybrid OWA-CWA argument for the enterprise. It is beyond the scope of this current piece to describe these approaches in detail, but some of the options include local CWA, the addition of rule languages and constraints to basic OWA, use of the new OWL 2, TopQuadrant’s SPIN notation, and others [11]. I will address some of these in a later post.
There are also questions about performance and scalability with open semantic technologies. Here, too, progress is rapid, with billion triple thresholds rapidly falling with daily reports of better performance [12]. Fortunately, the incremental approach that we advocate herein dovetails well with these rapid developments. There should be no arguing the benefits of a successful incremental project in a smaller domain, perhaps repeated across multiple domains, in comparison to large, costly initiatives that never produce (even though their underlying technologies are performant).
There are also architecture issues inherent in these OWA designs. In one of our next posts, we return to the topic of Web-oriented architecture and its role in support of these OWA knowledge management initiatives.
In the end, there is no substitute for doing and learning. KM based on OWA for the open semantic enterprise can be started today, in a focused manner with tangible benefits and outcomes, at low cost and risk. Let’s push the elephant out of the room and let the learning and doing begin.