Posted:February 15, 2010

Two Faces in Circle, from http://energeticrelations.com/Our Own Approach is Adaptive and Incremental

It is gratifying to see the emergence of the term semantic enterprise, with much increased attention and commentary. But, similar to different styles and patterns in software programming, there is not a single (nor best, depending on circumstance) way to approach becoming a semantic enterprise.

In this piece I contrast two styles. The more traditional and familiar one is comprehensive, complete and “engineered” in its approach. The second, and emerging style, is more adaptive and incremental. While Structured Dynamics is a proponent and thought leader for the adaptive style, the use and applicability of either approach is really a function of objectives and circumstances. The choice of approach depends on use case, and should not be a dogmatic one.

Any time a contrast is posed, one should be on guard about setting up a rhetorical strawman. There may perhaps be a bit of this flavor in this article; if so, it is unintended. It is probably best to realize that there is a gradient — or spectrum — of possible approaches between these contrasting styles. The real message is to understand these differences such that you can comfortably place your own organization at the right points along this spectrum.

A Spectrum of Advantages and Differences

The general idea of semantics in the enterprise preceeds the use of the term, having been somewhat captured before by the ideas of enterprise application integration, enterprise information integration and other concepts even related to data federation and data warehousing stretching back to the 1980s. However, as a specific label, we can look back to the first mentions in the late 1990s and more concerted attention beginning from about 2002 or so onward [1]. As another indicator, since 2005 the Semantic Technology Conference has given specific prominence to the enterprise [2].

Throughout this period, the sense from academic papers, many vendors, and most pundits [3] has been on things like automated reasoning, machine-aided decision making, aspects of artificial intelligence, and so forth. The general tone is often framed as “revolution” or “massive changes” or something “entirely new.” If you are a consultant or software/implementation vendor — especially where VC money is backing the venture with hopes for big returns and home runs — it may make cynical sense to sell such large and costly change.

I believe there are circumstances where the Semantic Enterprise writ this large may make sense and be financially justified. But, this kind of “big change” view has also seen relatively few visible (or successful) deployments. It has colored what it means to be a semantic enterprise. And, I believe, it has weakened market credibility by perhaps overpromising and underdelivering. The conventional view of what it is be a semantic enterprise deserves to be balanced.

So, as we balance this understanding of the semantic enterprise to one that is more nuanced, we can contrast the characteristics of the two apposite styles as follows:

Characteristics of the
Comprehensive, ‘Engineered’ Style
Characteristics of the
Adaptive, Incremental Style
  • A focus on a more complete, comprehensive coverage of the semantics in the domain
  • More enterprise-wide, less partial or departmental
  • Greater emphasis on “closed world” approaches [4]; more akin to relational database architecting and schema
  • Expansion is possible, but effort may be somewhat complex
  • A general implication is to replace or supplant existing information structures with semantic ones
  • Not necessarily based on semantic Web standards and languages [5] (e.g., may include Common Logic, frame logics, etc.)
  • Richer set of predicates (relations)
  • Though a distinction is maintained between schema and instances, their separation may not be consistently (physically) enforced
  • Often more complicated inferencing and logic tests
  • More complete enumeration and characterization of items
  • Much process around semantics agreement across groups
  • Fairly well-developed implementation tools, including for ontology engineering
  • Implementation times in months to years
  • Implementation costs akin to traditional large-scale IT projects
  • An emphasis on a simpler, incremental, “learn as you go” approach
  • Start with single departments or limited vertical apps
  • Embedded in the “open world” approach [4], with incorporation of external information
  • Design and approach inherently allows incremental expansion and adaptation
  • A key premise is to build from and leverage existing information structures, vocabularies and assets
  • Fully based on semantic Web standards and languages [5], often including linked data [6]
  • Tends to start simply with hierarchical or related concepts (e.g., SKOS)
  • Conscious distinction in the structure for handling schema separate from instances [7]
  • Inferencing logic based more on concept matching, or parent-child or part-of relationships
  • Degree of item characterization based on current scope
  • Initial semantic matching can be driven from existing assets
  • Fairly well-developed implementation tools, except for how to engage publics in the development process
  • Implementation times in weeks to months
  • Implementation costs driven by available budgets (and thus scope)

Note we have labeled the conventional approach as the “comprehensive, engineering” style; its contrast, and the one we position more closely to, is the “adaptive, incremental” style.

[Others have posited contrasting styles, most often as "top down" v. "bottom up." However, in one interpretation of that distinction, "top down" means a layer on top of the existing Web [8]. On the other hand, “top down” is more often understood in the sense of a “comprehensive, engineered” view, consistent with my own understanding [9]. Yet no matter which characterization, neither captures what I feel to be the more important considerations of mindset, logic and premise.]

Though the table above contrasts many points, I think there are two main distinctions to the adaptive approach. First, it firmly embraces the open world assumption. OWA is key to an incremental, “learn as you go” deployment that is also well suited to incorporation of external information. The second main distinction is to leverage and build from existing assets.

A Spectrum of Applications

Yet as noted in the opening, which of these approaches makes better sense depends on circumstance. One aspect of circumstance is available budget and deployment times for pilots or proofs-of-concept. Another aspect, of course, is the planned use or application for the deployment.

These are by no means hard distinctions, but in general we can see these contrasting approaches applying to the following uses:

Applications and Uses for the
Comprehensive, ‘Engineered’ Style
(i.e., more CWA driven)
Applications and Uses for the
Adaptive, Incremental Style
(i.e., more OWA driven)
  • Bounded, “inward” applications (high degree of control and completeness)
  • Engineering enterprises
  • Technical domains and organizations
  • Aeronautics
  • Pharmaceuticals
  • Chemicals
  • Petroleum
  • Energy
  • A/E firms (construction)
  • External facing applications, organizations (customers, incorporation of external data)
  • Faceted Search
  • Taxonomy updates
  • Multi-domain master data management (MDM)
  • Simple (initially) inferencing
  • Consumer products
  • Finance
  • Health care
  • Knowledge enterprises

A critical distinction is the nature of the enterprise itself. “External-facing” enterprises or functions that want or need to incorporate much external information (say, marketing or competitive intelligence) are advised to look closely at the adaptive approach. Organizations that have more complete control over their circumstances should perhaps focus on the conventional approach.

Adoption Thresholds and Risks

In previous writings I have pointed to the manifest benefits that can accrue to the semantic enterprise [see, esp. 10]. But we also have witnessed nearly a decade of promotion for semantics in the enterprise, with perhaps a lack of progress in some areas or unmet promises in others. These raise questions and skepticism of the real eventual costs and benefits.

I believe some of this skepticism is inherent with anything new — the general IT fatigue from what the current “next great thing” might be. But I also believe that some of this skepticism results from an approach to semantics in the enterprise that is both lengthy to deploy and high cost.

The key advantage of the adaptive, incremental approach is that the whole IT game in the enterprise can change. An open world approach enables adoption as it proves itself and as budgets allow. Commitments made under this approach have, in essence, permanent value. Past fears and concerns about making “wrong” bets no longer apply. With learning, targets can be re-adjusted, structure re-defined and applications re-focused, all as new discoveries and broadening scope dictate.

This does not make the adaptive approach better than the conventional one. But, it does make it less risky and, well, more adaptive.


[1] For example, the earliest Google mentions on “semantic enterprise” date to about 1998 or 1999. In 2002, the University of Georgia and Amit Sheth offered the first known academic course on the Semantic Enterprise; see http://lsdis.cs.uga.edu/SemanticEnterprise/.
[2] See the conference guide for the Semantic Technology Conference 2005. The sixth one, the 2010 Semantic Technology Conference, is upcoming on June 21-25 in San Francisco.
[3] See, for example, Mitchell Ummell, ed., 2009. “The Rise of the Semantic Enterprise,” special dedicated edition of the Cutter IT Journal, Vol. 22(9), 40 pp., September 2009. See http://www.cutter.com/offers/semanticenterprise.html (after filling out contact form). Partially in response to this conventional view, I wrote [10]. In that article I offered as a working definition that “a semantic enterprise is one that adopts the languages and standards of the semantic Web . . . and applies them to the issues of information interoperability, preferably using the best practices of linked data.” That happens to be Structured Dynamics’ preferred definition, though as this posting indicates, there is a spectrum of definitions of the term.
[4] See, M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room“, AI3:::Adaptive Information blog, December 21, 2009.
[5] See for example RDF, RDFS, OWL , SKOS and SPARQL and others.
[6] Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model. Two of the best practices are to name the data objects using uniform resource identifiers (URIs), and to expose the data for access via the HTTP protocol. Both of these practices enable the Web to become a distributed database, which also means that Web architectures can also be readily employed.

[7] We use a basis in description logics for defining the roles and splits in schema and instances. As we define it:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[8] One article that got quite a bit of play a few years back was A. Iskold, 2007. “Top Down: A New Approach to the Semantic Web,” in ReadWrite Web, Sept. 20, 2007. The problem with this terminology is that it offers a completely different sense of “top down” to traditional uses. In Iskold’s argument, his “top down” is a layering on top of the existing Web.
[9] The more traditional view of “top down” with respect to the semantic Web is in relation to how the system is constructed. This is reflected well in a presentation from the NSF Workshop on DB & IS Research for Semantic Web and Enterprises, April 3, 2002, entitled “The ‘Emergent, Semantic Web: Top Down Design or Bottom Up Consensus?“. Under this view, top down is design and committee-driven; bottom up is more decentralized and based on social processes, which is more akin to Iskold’s “top down.”
[10] M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise,” AI3:::Adaptive Information blog, Sept. 28, 2009.

Posted by AI3's author, Mike Bergman Posted on February 15, 2010 at 10:36 am in Adaptive Information, Semantic Web, Structured Dynamics | Comments (4)
The URI link reference to this post is: http://www.mkbergman.com/866/two-contrasting-styles-for-the-semantic-enterprise/
The URI to trackback this post is: http://www.mkbergman.com/866/two-contrasting-styles-for-the-semantic-enterprise/trackback/
Posted:February 2, 2010

Inkscape Logo

The Inkscape Process Can Also Aid Image Interchanges with Powerpoint

As we see more collaboration forums emerge, one question that naturally arises is the joint authoring or editing of images. This is particularly important as “official” slide decks or presentations come to the fore.

There are perhaps many different ways to skin this cat. In this article, I describe how to do so using the free, open source SVG editing program, Inkscape.

Why Inkscape?

Like many of you, I have been creating and editing images for years. I am by no means a graphics artist, but images and diagrams have been essential for communicating my work.

Until a few years back, I was totally a bitmap man. I used Paint Shop Pro (bought by Corel in 2004 and getting long in the tooth) and did a lot of copying and pasting.

I switched to Inkscape about two years ago for the following reasons:

  • I wanted re-use of image components via re-sizing and re-coloring, etc., and vector graphics are far superior to raster images for this purpose
  • I wanted a stable, free, usable editor and Inkscape was beginning to mature nicely (the current version 0.47 is even nicer and more stable)
  • Its SVG (scalable vector graphics) format was a standard adopted by the W3C after initial development by Adobe
  • SVG is an easily read and editable XML format
  • There was a growing source of online documentation
  • There was a growing repository of SVG graphics examples, including the broadscale use within Wikipedia (a good way to find stuff from this site is with the search “keywords site:http://commons.wikimedia.org filetype:svg” on your favorite search engine, after substituting your specific keywords).

How to Collaborate with Inkscape

Once you have a working image in Inkscape, make sure all collaborators have a copy of the software. Then:

  1. Isolate the picture (sometimes there are multiple images in a single file) by deleting all extraneous image stuff in the file
  2. From the toolbar, click on the Zoom to fit drawing in window icon [Zoom to fit drawing in window]; this will resize and put your target image in the full display window
  3. Under File -> Document Properties … check Show page border and Show border shadow, then Fit page to selection. This helps size the image properly in the exported file for sharing or collaboration
  4. Save the file as an *.svg option, and name the file with a date/time stamp and author extension (useful for tracking multiple author edits over time)
  5. If in multiple author mode, make sure who has current “ownership” of the image is clear.

How to Share with Powerpoint

Of course, it is more often the case that not all collaborators may have a copy of Inkscape or that the image began in the SVG format.

The image below began as a Windows Powerpoint clip art file, which has then gone through some modifications. Note the bearded guy’s hand holding the paper is out of registry (because I screwed up in earlier editing, but I also can easily fix because it is a vector image!  ;)  ). Also note we have the border from Inkscape as suggested above.  This file, BTW, is people.png, and was created as a PNG after a screen capture from Inkscape:

PNG representation of an SVG

When beginning in Powerpoint or as clip art, files in the format of Windows metafile (*.wmf) or extended WMF (*.emf) work well. (For example, you can download and play with the native Inkscape format of people.svg, or the people.wmf or people.emf versions of the image above.) If you already have images in a Powerpoint presentation, save in one of these two formats, with (*.emf) preferred. (EMF is generally better for text.)

You can open or load these files directly into Inkscape. Generally, they will come in as a group of vectors; to edit the pieces, you should “ungroup.”

After editing per the instructions in the previous section, if you need to re-insert back into Powerpoint, please use the *.emf format (and make sure you do not save text as paths).

For example, see the following PNG graphic taken from a Inkscape file (figure_text.svg):

PNG representation of an SVG

We can save it as an EMF (figure_textpath.emf) to a Powerpoint, with the option of converting text to paths:

Text-to-path EMF

Or, we can save it as an EMF (figure_text.emf) to a Powerpoint, only this time not converting text to paths and then “ungrouping” once in Powerpoint:

EMF with no text to path

Note the latter option, text not as path, is the far superior one. However, also note that borders are added to the figures and vertical text is rotated 90o back to horizontal. Nonetheless, the figure is fully editable, including text. Also, if the original Inkscape figures are constructed with lines of the same color as fills, the border conversion also works well.

Frankly, especially with text, because there can be orientation and other changes going from Inkscape to Powerpoint, I recommend using Inkscape and its native SVG for all early modifications and to keep a canonical copy of your images. Then, prior to completion of the deck, save as EMF for import into Powerpoint and then clean up. If changes later need to be made to the graphic, I recommend doing so in Inkscape and then re-importing.

Other Alternatives

I should note there is an option, as well, in Inkscape to convert raster images to vector ones (use Path -> Trace bitmap … and invoke the multiple scans with colors). This is doable, but involves quite a bit of image copying, manipulation and color separation to achieve workable results. You may want to see further Inkscape’s documentation on tracing, or more fully this reference dealing with color.

Of course, there are likely many other ways to approach these issues of collaboration and sharing. I will leave it to others to suggest and explain those options.

Posted:January 26, 2010

AI3's Ontologies category
140 Tools: 20 Must Haves, 70 Possible Usefuls, and 50 Has Beens and Marginals

Well, for another client and another purpose, I was goaded into screening my Sweet Tools listing of semantic Web and -related tools and to assemble stuff from every other nook and cranny I could find. The net result is this enclosed listing of some 140 or so tools — most open source — related to semantic Web ontology building in one way or another.

Ever since I wrote my Intrepid Guide to Ontologies nearly three years ago (and one of the more popular articles of this site, though it is now perhaps a bit long in the tooth), I have been intrigued with how these semantic structures are built and maintained. That interest, in no small measure, is why I continue to maintain the Sweet Tools listing.

As far as I know, the following is the largest and most comprehensive listing of ontology building tools available. I broadly interpret the classification of ‘ontology building’; I include, for example, vocabulary extraction and prompting tools, as well as ontology visualization and mapping.

There are some 140 tools, perhaps 90 or so are still in active use. (Given the scope, not every tool could be inspected in detail. Some listed as being perhaps inactive may not be so, and others not in that category perhaps should be.) Of the entire roster of tools, somewhere on the order of 12 to 20 are quite impressive and deserving of local installation, test runs, and close inspection.

There are relatively few tools useful to non-specialists (or useful to engaging knowledgeable publics in the ontology-building exercise). There appear to be key gaps in the entire workflow from domain scoping and initial ontology definition and vocabulary candidates, to longer-term maintenance and revision. For example, spreadsheets would appear to be a possible useful first step in any workflow process (which is why irON is listed), but the spreadsheet tool per se is not listed herein (nor are text editors).

I surely have missed some tools and likely improperly assigned others. Please drop me an email or comment on this post with any revisions or suggestions.

Some Worth A Closer Look

In my own view, there are some tools that definitely deserve a closer look. My favorite candidates — for very different reasons and for very different places in the workflow — are (in no particular order): Apelon DTS, irON, FlexViz, Knoodl, Protégé, diagramic.com, BooWa, COE, ontopia, Anzo, PoolParty, Vine (and voc2rdf), Erca, Graphl, and GrOWL. Each one of these links is more fully described below. Also, all tools in the Vocabulary Prompting Tools category (which also includes extraction) are worth reviewing since all or nearly all have online demos.

Other tools may also be deserving, depending on use case. Some of the more specific analysis and conversion tools, for example, are in the Miscellaneous category.

Also, some purists may quibble with why some tools are listed here (such as inclusion of some stuff related to Topic Maps). Well, my answer to that is there are no real complete solutions, and whatever we can pragmatically do today requires glueing together many disparate parts.

Comprehensive Ontology Tools

  • Altova SemanticWorks is a visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design. No open source version available
  • Amine is a rather comprehensive, open source platform for the development of intelligent and multi-agent systems written in Java. As one of its components, it has an ontology GUI with text- and tree-based editing modes, with some graph visualization
  • The Apelon DTS (Distributed Terminology System) is an integrated set of open source components that provides comprehensive terminology services in distributed application environments. DTS supports national and international data standards, which are a necessary foundation for comparable and interoperable health information, as well as local vocabularies. Typical applications for DTS include clinical data entry, administrative review, problem-list and code-set management, guideline creation, decision support and information retrieval.. Though not strictly an ontology management system, Apelon DTS has plug-ins that provide visualization of concept graphs and related functionality that make it close to a complete solution
  • DOME is a programmable XML editor which is being used in a knowledge extraction role to transform Web pages into RDF, and available as Eclipse plug-ins. DOME stands for DERI Ontology Management Environment
  • FlexViz is a Flex-based, Protégé-like client-side ontology creation, management and viewing tool; very impressive. The code is distributed from Sourceforge; there is a nice online demo available; there is a nice explanatory paper on the system, and the developer, Chris Callendar, has a useful blog with Flex development tips
  • Knoodl facilitates community-oriented development of OWL based ontologies and RDF knowledge bases. It also serves as a semantic technology platform, offering a Java service-based interface or a SPARQL-based interface so that communities can build their own semantic applications using their ontologies and knowledgebases. It is hosted in the Amazon EC2 cloud and is available for free; private versions may also be obtained. See especially the screencast for a quick introduction
  • The NeOn toolkit is a state-of-the-art, open source multi-platform ontology engineering environment, which provides comprehensive support for the ontology engineering life-cycle. The v2.3.0 toolkit is based on the Eclipse platform, a leading development environment, and provides an extensive set of plug-ins covering a variety of ontology engineering activities. You can add these plug-ins or get a current listing from the built-in updating mechanism
  • ontopia is a relative complete suite of tools for building, maintaining, and deploying Topic Maps-based applications; open source, and written in Java. Could not find online demos, but there are screenshots and there is visualization of topic relationships
  • Protégé is a free, open source visual ontology editor and knowledge-base framework. The Protégé platform supports two main ways of modeling ontologies via the Protégé-Frames and Protégé-OWL editors. Protégé ontologies can be exported into a variety of formats including RDF(S), OWL, and XML Schema. There are a large number of third-party plugins that extends the platform’s functionality
    • Protégé Plugin Library – frequently consult this page to review new additions to the Protégé editor; presently there are dozens of specific plugins, most related to the semantic Web and most open source
    • Collaborative Protégé is a plug-in extension of the existing Protégé system that supports collaborative ontology editing as well as annotation of both ontology components and ontology changes. In addition to the common ontology editing operations, it enables annotation of both ontology components and ontology changes. It supports the searching and filtering of user annotations, also known as notes, based on different criteria. There is also an online demo
  • TopBraid Composer is an enterprise-class modeling environment for developing Semantic Web ontologies and building semantic applications. Fully compliant with W3C standards, Composer offers comprehensive support for developing, managing and testing configurations of knowledge models and their instance knowledge bases. It is based on the Eclipse IDE. There is a free version (after registration) for small ontologies.

Not Apparently in Active Use

  • Adaptiva is a user-centred ontology building environment, based on using multiple strategies to construct an ontology, minimising user input by using adaptive information extraction
  • Exteca is an ontology-based technology written in Java for high-quality knowledge management and document categorisation, including entity extraction. Though code is still available, no updates have been provided since 2006. It can be used in conjunction with search engines
  • IODT is IBM’s toolkit for ontology-driven development. The toolkit includes EMF Ontolgy Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva)
  • KAON is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. An important focus of KAON is scalable and efficient reasoning with ontologies
  • Ontolingua provides a distributed collaborative environment to browse, create, edit, modify, and use ontologies. The server supports over 150 active users, some of whom have provided us with descriptions of their projects. Provided as an online service; software availability not known.

Vocabulary Prompting Tools

  • AlchemyAPI from Orchestr8 provides an API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages
  • BooWa is a set expander for any language (formerly known as SEALS); developed by RC Wang of Carnegie Mellon
  • Google Keywords allows you to enter a few descriptive words or phrases or a site URL to generate keyword ideas
  • Google Sets for automatically creating sets of items from a few examples
  • Open Calais is free limited API web service to automatically attach semantic metadata to content, based on either entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), or events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID)
  • Query-by-document from BlogScope has a nice phrase extraction service, with a choice of ranking methods. Can also be used in a Firefox plug-in (not texted with 3.5+)
  • SemanticHacker (from Textwise) is an API that does a number of different things, including categorization, search, etc. By using ‘concept tags’, the API can be leveraged to generate metadata or tags for content
  • TagFinder is a Web service that automatically extracts tags from a piece of text. The tags are chosen based on both statistical and linguistic analysis of the original text
  • Tagthe.net has a demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations
  • TermExtractor extracts terminology consensually referred in a specific application domain. The software takes as input a corpus of domain documents, parses the documents, and extracts a list of “syntactically plausible” terms (e.g. compounds, adjective-nouns, etc.)
  • TermFinder uses Poisson statistics, the Maximum Likelihood Estimation and Inverse Document Frequency between the frequency of words in a given document and a generic corpus of 100 million words per language; available for English, French and Italian
  • TerMine is an online and batch term extractor that emphasizes part of speech (POS) and n-gram (phrase extraction). TerMine is the terminological management system with the C-Value term extraction and AcroMine acronym recognition integrated
  • Topia term extractor is a part-of-speech and frequency based term extraction tool implemented in python. Here is a term extraction demo based on this tool
  • Topicalizer is a service which automatically analyses a document specified by a URL or a plain text regarding its word, phrase and text structure. It provides a variety of useful information on a given text including the following: Word, sentence and paragraph count, collocations, syllable structure, lexical density, keywords, readability and a short abstract on what the given text is about
  • TrMExtractor does glossary extraction on pure text files for either English or Hungarian
  • Wikify! is a system to automatically “wikify” a text by adding Wikipedia-like tags throughout the document. The system extracts keywords and then disambiguates and matches them to their corresponding Wikipedia definition
  • Yahoo! Placemaker is a freely available geoparsing Web service. It helps developers make their applications location-aware by identifying places in unstructured and atomic content – feeds, web pages, news, status updates – and returning geographic metadata for geographic indexing and markup
  • Yahoo! Term Extraction Service is an API to Yahoo’s term extraction service, as well as many other APIs and services in a variety of languages and for a variety of tasks; good general resource. The service has been reported to be shut down numerous times, but apparently is kept alive due to popular demand.

Initial Ontology Development

  • COE COE (CmapTools Ontology Editor) is a specialized version of the CmapTools from IMHC. COE — and its CmapTools parent — is based on the idea of concept maps. A concept map is a graph diagram that shows the relationships among concepts. Concepts are connected with labeled arrows, with the relations manifesting in a downward-branching hierarchical structure. COE is an integrated suite of software tools for constructing, sharing and viewing OWL encoded ontologies based on these constructs
  • Conzilla2 is a second generation concept browser and knowledge management tool with many purposes. It can be used as a visual designer and manager of RDF classes and ontologies, since its native storage is in RDF. It also has an online collaboration server
  • http://diagramic.com/ has an online Flex network graph demo, which also has a neat facility for quick entry and visualization of relationships; mostly small scale; pretty cool. Does not appear to be code available anywhere
  • DogmaModeler is a free and open source, ontology modeling tool based on ORM. The philosophy of DogmaModeler is to enable non-IT experts to model ontologies with a little or no involvement of an ontology engineer; project is quite old, but the software is still available and it may provide some insight into naive ontology development
  • Erca is a framework that eases the use of Formal and Relational Concept Analysis, a neat clustering technique. Though not strictly an ontology tool, Erca could be implemented in a work flow that allows easy import of formal contexts from CSV files, then algorithms that computes the concept lattice of the formal contexts that can be exported as dot graphs (or in JPG, PNG, EPS and SVG formats). Erca is provided as an Eclipse plug-in
  • GraphMind is a mindmap editor for Drupal. It has the basic mindmap features and some Drupal specific enhancements. There is a quick screencast about how GraphMind looks like and what is does. The Flex source is also available from Github
  • GrOWL is the software framework to provide graphical, intuitive browsing and editing of knowledge maps. GrOWL is open source and is used in several projects worldwide. None of the online demos apparently work, but the screenshots look interesting and the code is still available
  • irON using spreadsheets, via its notation and specification. Spreadsheets can be used for initial authoring, esp if the irON guidelines are followed. See further this case study of Sweet Tools in a spreadsheet using irON (commON)
  • ITM T3 stands for Terminology, Thesaurus, Taxonomy, Metadata dictionary. ITM T3 includes a range of functions for managing enterprise shareable multilingual domain-specific taxonomies, thesaurus, terminologies in a unified way. It uses XML, SKOS and RDF standards. Commercial; from Mondeca
  • MindRaider is Semantic Web outliner. It aims to connect the tradition of outline editors with emerging technologies. MindRaider mission is to organize not only the content of your hard drive but also your cognitive base and social relationships in a way that enables quick navigation, concise representation and inferencing
  • Topincs is a Topic Map authoring software that allows groups to share their knowledge over the web. It makes use of a variety of modern technologies. The most important are Topic Maps, REST and Ajax. It consists of three components: the Wiki, the Editor, and the Server. The servier requires AMP; the Editor and Wiki are based on browser plug-ins.

Ontology Editing

  • First, see all of the Comprehensive Tools listing above
  • Anzo for Excel includes an (RDFS and OWL-based) ontology editor that can be used directly within Excel. In addition to that, Anzo for Excel includes the capability to automatically generate an ontology from existing spreadsheet data, which is very useful for quick bootstrapping of an ontology.
  • Hozo is an ontology visualization and development tool that brings version control constructs to group ontology development; limited to a prototype, with no online demo
  • Lexaurus Editor is for off-line creation and editing of vocabularies, taxonomies and thesauri. It supports import and export in Zthes and SKOS XML formats, and allows hierarchical / poly-hierarchical structures to be loaded for editing, or even multiple vocabularies to be loaded simultaneously, so that terms from one taxonomy can be re-used in another, using drag and drop. Not available in open source
  • Model Futures OWL Editor combines simple OWL tools, featuring UML (XMI), ErWin, thesaurus and imports. The editor is tree-based and has a “navigator” tool for traversing property and class-instance relationships. It can import XMI (the interchange format for UML) and Thesaurus Descriptor (BT-NT XML), and EXPRESS XML files. It can export to MS Word.
  • OntoTrack is a browsing and editing ontology authoring tool for OWL Lite. It combines a sophisticated graphical layout with mouse enabled editing features optimized for efficient navigation and manipulation of large ontologies
  • OWLViz is an attractive visual editor for OWL and is available as a Protégé plug-in
  • PoolParty is a triple store-based thesaurus management environment which uses SKOS and text extraction for tag recommendations. See further this manual, which describes more fully the system’s functionality. Also, there is a PoolParty Web service that enables a Zthes thesaurus in XML format to be uploaded and converted to SKOS (via skos:Concepts)
  • SKOSEd is a plugin for Protege 4 that allows you to create and edit thesauri (or similar artefacts) represented in the Simple Knowledge Organisation System (SKOS).
  • TemaTres is a Web application to manage controlled vocabularies, taxonomies and thesaurus. The vocabularies may be exported in Zthes, Skos, TopicMap, etc.
  • ThManager is a tool for creating and visualizing SKOS RDF vocabularies. ThManager facilitates the management of thesauri and other types of controlled vocabularies, such as taxonomies or classification schemes
  • Vitro is a general-purpose web-based ontology and instance editor with customizable public browsing. Vitro is a Java web application that runs in a Tomcat servlet container. With Vitro, you can: 1) create or load ontologies in OWL format; 2) edit instances and relationships; 3) build a public web site to display your data; and 4) search your data with Lucene. Still in somewhat early phases, with no online demos and with minimal interfaces.

Not Apparently in Active Use

  • Omnigator The Omnigator is a form-based manipulaton tool centered on Topic Maps, though it enables the loading and navigation of any conforming topic map in XTM, HyTM, LTM or RDF formats. There is a free evaluation version.
  • OntoGen is a semi-automatic and data-driven ontology editor focusing on editing of topic ontologies (a set of topics connected with different types of relations). The system combines text-mining techniques with an efficient user interface. It requires .Net.
  • OWL-S-editor is an editor for the development of services in OWL-S, with graphical, WSDL and import/export support
  • ReTAX+ is an aide to help a taxonomist create a consistent taxonomy and in particular provides suggestions as to where a new entity could be placed in the taxonomy whilst retaining the integrity of the revised taxonomy (c.f., problems in ontology modelling)
  • SWOOP is a lightweight ontology editor. (Swoop is no longer under active development at mindswap. Continuing development can be found on SWOOP’s Google Code homepage at http://code.google.com/p/swoop/)
  • WebOnto supports the browsing, creation and editing of ontologies through coarse grained and fine grained visualizations and direct manipulation.

Ontology Mapping

  • COMA++ is a schema and ontology matching tool with a comprehensive infrastructure. Its graphical interface supports a variety of interaction
  • ConcepTool is a system to model, analyse, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
  • MatchIT automates and facilitates schema matching and semantic mapping between different Web vocabularies. MatchIT runs as a stand-alone or plug-in Eclipse application and can be integrated with popular third party applications. MatchIT’s uses Adaptive Lexicon™ as an ontology-driven dictionary and thesaurus of English language terminology to quantify and ank the semantic similarity of concepts. It apparently is not available in open source
  • myOntology is used to produce the theoretical foundations, and deployable technology for the Wiki-based, collaborative and community-driven development and maintenance of ontologies instance data and mappings
  • OLA/OLA2 (OWL-Lite Alignment) matches ontologies written in OWL. It relies on a similarity combining all the knowledge used in entity descriptions. It also deal with one-to-many relationships and circularity in entity descriptions through a fixpoint algorithm
  • Potluck is a Web-based user interface that lets casual users—those without programming skills and data modeling expertise—mash up data themselves. Potluck is novel in its use of drag and drop for merging fields, its integration and extension of the faceted browsing paradigm for focusing on subsets of data to align, and its application of simultaneous editing for cleaning up data syntactically. Potluck also lets the user construct rich visualizations of data in-place as the user aligns and cleans up the data.
  • PRIOR+ is a generic and automatic ontology mapping tool, based on propagation theory, information retrieval technique and artificial intelligence model. The approach utilizes both linguistic and structural information of ontologies, and measures the profile similarity and structure similarity of different elements of ontologies in a vector space model (VSM).
  • Vine is a tool that allows users to perform fast mappings of terms across ontologies. It performs smart searches, can search using regular expressions, requires a minimum number of clicks to perform mappings, can be plugged into arbitrary mapping framework, is non-intrusive with mappings stored in an external file, has export to text files, and adds metadata to any mapping. See also http://sourceforge.net/projects/vine/.

Not Apparently in Active Use

  • ASMOV (Automated Semantic Mapping of Ontologies with Validation) is an automatic ontology matching tool which has been designed in order to facilitate the integration of heterogeneous systems, using their data source ontologies
  • Chimaera is a software system that supports users in creating and maintaining distributed ontologies on the web. Two major functions it supports are merging multiple ontologies together and diagnosing individual or multiple ontologies
  • CMS (CROSI Mapping System) is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
  • ConRef is a service discovery system which uses ontology mapping techniques to support different user vocabularies
  • DRAGO reasons across multiple distributed ontologies interrelated by pairwise semantic mappings, with a vision of peer-to-peer mapping of many distributed ontologies on the Web. It is implemented as an extension to an open source Pellet OWL Reasoner
  • Falcon-AO (Finding, aligning and learning ontologies) is an automatic ontology matching tool that includes the three elementary matchers of String, V-Doc and GMO. In addition, it integrates a partitioner PBM to cope with large-scale ontologies
  • FOAM is the Framework for ontology alignment and mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
  • hMAFRA (Harmonize Mapping Framework) is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
  • IF-Map is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
  • LILY is a system matching heterogeneous ontologies. LILY extracts a semantic subgraph for each entity, then it uses both linguistic and structural information in semantic subgraphs to generate initial alignments. The system is presently in a demo version only
  • MAFRA Toolkit – the Ontology MApping FRAmework Toolkit allows users to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
  • OntoEngine is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target”
  • OWLS-MX is a hybrid semantic Web service matchmaker. OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
  • RiMOM (Risk Minimization based Ontology Mapping) integrates different alignment strategies: edit-distance based strategy, vector-similarity based strategy, path-similarity based strategy, background-knowledge based strategy, and three similarity-propagation based strategies
  • semMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties
  • Snoggle is a graphical, SWRL-based ontology mapper. Snoggle attempts to solve the ontology mapping problem by providing a graphical user interface (similar to which of the Microsoft Visio) to guide the process of ontology vocabulary alignment. In Snoggle, user-defined mappings can be serialized into rules, which is expressed using SWRL
  • Terminator is a tool for creating term to ontology resource mappings (documentation in Finnish).

Ontology Visualization/Analysis

Though all are not relevant, see my post from a couple of years back on large-scale RDF graph software.

  • Social network graphing tools (many covered elsewhere)
  • Cytoscape is a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data; I have also written specifically about Cytoscape’s use in UMBEL
    • RDFScape is a project that brings Semantic Web “features” to the popular Systems Biology software Cytoscape
    • NetworkAnalyzer performs analysis of biological networks and calculates network topology parameters including the diameter of a network, the average number of neighbors, and the number of connected pairs of nodes. It also computes the distributions of more complex network parameters such as node degrees, average clustering coefficients, topological coefficients, and shortest path lengths. It displays the results in diagrams, which can be saved as images or text files; used by SD
  • Graphl is a tool for collaborative editing and visualisation of graphs, representing relationships between resources or concepts of the real world. Graphl may be thought of as a visual wiki, a place where everybody can contribute to a shared repository of knowledge
  • igraph is a free software package for creating and manipulating undirected and directed graphs
  • Network Workbench is a very complex, comprehensive; Swiss Army Knife
  • NetworkX – Python; very clean
  • Stanford Network Analysis Package (SNAP) is a general purpose network analysis and graph mining library. It is written in C++ and easily scales to massive networks with hundreds of millions of nodes
  • Social Networks Visualizer (SocNetV) is a flexible and user-friendly tool for the analysis and visualization of Social Networks. It lets you construct networks (mathematical graphs) with a few clicks on a virtual canvas or load networks of various formats (GraphViz, GraphML, Adjacency, Pajek, UCINET, etc) and modify them to suit your needs. SocNetV also offers a built-in web crawler, allowing you to automatically create networks from all links found in a given initial URL
  • Tulip may be incredibly strong
  • Springgraph component for Flex
  • VizierFX is a Flex library for drawing network graphs. The graphs are laid out using GraphViz on the server side, then passed to VizierFX to perform the rendering. The library also provides the ability to run ActionScript code in response to events on the graph, such as mousing over a node or clicking on it.

Miscellaneous Ontology Tools

  • Apolda (Automated Processing of Ontologies with Lexical Denotations for Annotation) is a plugin (processing resource) for GATE (http://gate.ac.uk/). The Apolda processing resource (PR) annotates a document like a gazetteer, but takes the terms from an (OWL) ontology rather than from a list
  • DL-Learner is a tool for learning complex classes from examples and background knowledge. It extends Inductive Logic Programming to Description Logics and the Semantic Web. DL-Learner now has a flexible component based design, which allows to extend it easily with new learning algorithms, learning problems, reasoners, and supported background knowledge sources. A new type of supported knowledge sources are SPARQL endpoints, where DL-Learner can extract knowledge fragments, which enables learning classes even on large knowledge sources like DBpedia, and includes an OWL API reasoner interface and Web service interface.
  • LexiLink is a tool for building, curating and managing multiple lexicons and ontologies in one enterprise-wide Web-based application. The core of the technology is based on RDF and OWL
  • mopy is the Music Ontology Python library, designed to provide easy to use python bindings for ontology terms for the creation and manipulation of music ontology data. mopy can handle information from several ontologies, including the Music Ontology, full FOAF vocab, and the timeline and chord ontologies.
  • OBDA (Ontology Based Data Access) is a plugin for Protégé aimed to be a full-fledged OBDA ontology and component editor. It provides data source and mapping editors, as well as querying facilities that, in sum, allow you to design and test every aspect of an OBDA system. It supports relational data sources (RDBMS) and GLAV-like mappings. In its current beta form, it requires Protege 3.3.1, a reasoner implementing the OBDA extensions to DIG 1.1 (e.g., the DIG server for QuOnto) and Jena 2.5.5
  • OntoComP is a Protégé 4 plugin for completing OWL ontologies. It enables the user to check whether an OWL ontology contains “all relevant information” about the application domain, and extend the ontology appropriately if this is not the case
  • Ontology Browser is a browser created as part of the CO-ODE (http://www.co-ode.org/) project; rather simple interface and use
  • Ontology Metrics is a web-based tool that displays statistics about a given ontology, including the expressivity of the language it is written in
  • OntoSpec is a SWI-Prolog module, aiming at automatically generating XHTML specification from RDF-Schema or OWL ontologies
  • OWL API is a Java interface and implementation for the W3C Web Ontology Language (OWL), used to represent Semantic Web ontologies. The API is focused towards OWL Lite and OWL DL and offers an interface to inference engines and validation functionality
  • OWL Module Extractor is a Web service that extracts a module for a given set of terms from an ontology. It is based on an implementation of locality-based modules that is part of the OWL API.
  • OWL Syntax Converter is an online tool for converting ontologies between different formats, including several OWL syntaxes, RDF/XML, KRSS
  • OWL Verbalizer is an on-line tool that verbalizes OWL ontologies in (controlled) English
  • OwlSight is an OWL ontology browser that runs in any modern web browser; it’s developed with Google Web Toolkit and uses Gwt-Ext, as well as OWL-API. OwlSight is the client component and uses Pellet as its OWL reasoner
  • Pellint is an open source lint tool for Pellet which flags and (optionally) repairs modeling constructs that are known to cause performance problems. Pellint recognizes several patterns at both the axiom and ontology level.
  • PROMPT is a tab plug-in for Protégé is for managing multiple ontologies by comparing versions of the same ontology, moving frames between included and including project, merging two ontologies into one, or extracting a part of an ontology.
  • SegmentationApp is a Java application that segments a given ontology according to the approach described in “Web Ontology Segmentation: Analysis, Classification and Use” (http://www.co-ode.org/resources/papers/seidenberg-www2006.pdf)
  • SETH is a software effort to deeply integrate Python with Web Ontology Language (OWL-DL dialect). The idea is to import ontologies directly into the programming context so that its classes are usable alongside standard Python classes
  • SKOS2GenTax is an online tool that converts hierarchical classifications available in the W3C SKOS (Simple Knowledge Organization Systems) format into RDF-S or OWL ontologies
  • SpecGen (v5) is an ontology specification generator tool. It’s written in Python using Redland RDF library and licensed under the MIT license
  • Text2Onto is a framework for ontology learning from textual resources that extends and re-engineers an earlier framework developed by the same group (TextToOnto). Text2Onto offers three main features: it represents the learned knowledge at a metalevel by instantiating the modelling primitives of a Probabilistic Ontology Model (POM), thus remaining independent from a specific target language while allowing the translation of the instantiated primitives
  • Thea is a Prolog library for generating and manipulating OWL (Web Ontology Language) content. Thea OWL parser uses SWI-Prolog’s Semantic Web library for parsing RDF/XML serialisations of OWL documents into RDF triples and then it builds a representation of the OWL ontology
  • TONES Ontology Repository is primarily designed to be a central location for ontologies that might be of use to tools developers for testing purposes; it is part of the TONES project
  • Visual Ontology Manager (VOM) is a family of tools enables UML-based visual construction of component-based ontologies for use in collaborative applications and interoperability solutions.
  • Web Ontology Manager is a lightweight, Web-based tool using J2EE for managing ontologies expressed in Web Ontology Language (OWL). It enables developers to browse or search the ontologies registered with the system by class or property names. In addition, they can submit a new ontology file
  • RDF evoc (external vocabulary importer) is an RDF external vocabulary importer module (evoc) for Drupal caches any external RDF vocabulary and provides properties to be mapped to CCK fields, node title and body. This module requires the RDF and the SPARQL modules.

Not Apparently in Active Use

  • Almo is an ontology-based workflow engine in Java supporting the ARTEMIS project; part of the OntoWare initiative
  • ClassAKT is a text classification web service for classifying documents according to the ACM Computing Classification System
  • Elmo provides a simple API to access ontology oriented data inside a Sesame RDF repository. The domain model is simplified into independent concerns that are composed together for multi-dimensional, inter-operating, or integrated applications
  • ExtrAKT is a tool for extracting ontologies from Prolog knowledge bases.
  • F-Life is a tool for analysing and maintaining life-cycle patterns in ontology development.
  • Foxtrot is a recommender system which represents user profiles in ontological terms, allowing inference, bootstrapping and profile visualization.
  • HyperDAML creates an HTML representation of OWL content to enable hyperlinking to specific objects, properties, etc.
  • LinKFactory is an ontology management tool, it provides an effective and user-friendly way to create, maintain and extend extensive multilingual terminology systems and ontologies (English, Spanish, French, etc.). It is designed to build, manage and maintain large, complex, language independent ontologies.
  • LSW – the Lisp semantic Web toolkit enables OWL ontologies to be visualized. It was written by Alan Ruttenberg
  • Ontodella is a Prolog HTTP server for category projection and semantic linking
  • OntoWeaver is an ontology-based approach to Web sites, which provides high level support for web site design and development
  • OWLLib is a PHP library for accessing OWL files. OWL is w3.org standard for storing semantic information
  • pOWL is a Semantic Web development platform for ontologies in PHP. pOWL consists of a number of components, including RAP
  • ROWL is the Rule Extension of OWL; it is from the Mobile Commerce Lab in the School of Computer Science at Carnegie Mellon University
  • Semantic Net Generator is a utlity for generating Topic Maps automatically from different data sources by using rules definitions specified with Jelly XML syntax. This Java library provides Jelly tags to access and modify data sources (also RDF) to create a semantic network
  • SMORE is OWL markup for HTML pages. SMORE integrates the SWOOP ontology browser, providing a clear and consistent way to find and view Classes and Properties, complete with search functionality
  • SOBOLEO is a system for Web-based collaboration to create SKOS taxonomies and ontologies and to annotate various Web resources using them
  • SOFA is a Java API for modeling ontologies and Knowledge Bases in ontology and Semantic Web applications. It provides a simple, abstract and language neutral ontology object model, inferencing mechanism and representation of the model with OWL, DAML+OIL and RDFS languages; from java.dev
  • WebScripter is a tool that enables ordinary users to easily and quickly assemble reports extracting and fusing information from multiple, heterogeneous DAMLized Web sources.
Posted:November 11, 2009

irON - instance record and Object Notation

A Case Study of Turning Spreadsheets into Structured Data Powerhouses

In a former life, I had the nickname of ‘Spreadsheet King’ (perhaps among others that I did not care to hear). I had gotten the nick because of my aggressive use of spreadsheets for financial models, competitors tracking, time series analyses, and the like. However, in all honesty, I have encountered many others in my career much more knowledgeable and capable with spreadsheets than I’ll ever be. So, maybe I was really more like a minor duke or a court jester than true nobility.

Yet, pro or amateur, there are perhaps 1 billion spreadsheet users worldwide [1], making spreadsheets undoubtedly the most prevalent data authoring environment in existence. And, despite moans and wails about how spreadsheets can lead to chaos, spaghetti code, or violations of internal standards, they are here to stay.

Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value “locked up” in existing spreadsheets.

How to “unlock” this value for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) [2]. I recently published a case study [3] that describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools [4] to aid users in understanding and using the commON notation. I summarize portions of that study herein.

This is the second article of a two-part series related to the recent Sweet Tools update.

Background on Sweet Tools and irON

The dataset that is the focus of this use case, Sweet Tools, began as an informal tracking spreadsheet about four years ago. I began it as a way to learn about available tools in the semantic Web and -related spaces. I began publishing it and others found it of value so I continued to develop it.

As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact. The current version has well over 800 tools listed, characterized by many structured data attributes such as type, programming language, description and so forth. As it has grown, a formal controlled vocabulary has also evolved to bring consistency to the characterization of many of these attributes.

It was natural for me to maintain this listing as a spreadsheet, which was also reinforced when I was one of the first to adopt an Exhibit presentation of the data based on a Google spreadsheet about three years back. Here is a partial view of this spreadsheet as I maintain it locally:

Sweet Tools Main Spreadsheet Screen
(click to expand)

When we began to develop irON in earnest as a simple (“naïve”) dataset authoring framework, it was clear that a comma-separated value, or CSV [5], option should join the other two serializations under consideration, XML and JSON. CSV, though less expressive and capable as a data format than the other serializations, still has an attribute-value pair (also known as key-value pairs and many other variants [6]) orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.

As a dataset very familiar to us as irON‘s editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically what came to be known as the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.

The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.

Dataset Authoring in a Spreadsheet

A large portion of the case study describes the many advantages of authoring small datasets within spreadsheets. The useful thing about the CSV format is that these full functional capabilities of the spreadsheet are available during authoring or later updates and modifications, but, when exported, the CSV provides a relatively clean format for processing and parsing.

So, some of the reasons for small dataset authoring in a spreadsheet include:

  • Formatting and on-sheet management -  the first usefulness of a spreadsheet comes from being able to format and organize the records. Records can be given background colors to highlight distinctions (new entries, for example); live URL links can be embedded; contents can be wrapped and styled within cells; and the column and row heads can be “frozen”, useful when scrolling large workspaces
  • Named blocks and sorting – named blocks are a powerful feature of modern spreadsheets, useful for data manipulation, printing and internal referencing by formulas and the like.  Sorting with named blocks is especially important as an aid to check consistency of terminology, records completeness, duplicates checks, missing value checks, and the like. Named blocks can also be used as references in calculations. All of these features are real time savers, especially when datasets grow large and consistency of treatment and terminology is important
  • Multiple sheets and consolidated accesscommON modules can be specified on a single worksheet or multiple worksheets and saved as individual CSV files; because of its size and relative complexity, the Sweet Tools dataset is maintained on multiple sheets. Multi-worksheet environments help keep related data and notes consolidated and more easily managed on local hard drives
  • Completeness and counts - the spreadsheet counta function is useful to sum counts for cell entries by both column and row, a useful aid to indicate if an attribute or type value is missing or if a record is incomplete.  Of course, similar helps and uses can be found for many of the hundreds of embedded functions within a spreadsheet
  • Controlled vocabularies and data entry validation – quality datasets often hinge on consistency and uniform values and terminology; the data validation utilities within spreadsheets can be applied to Boolean, ranges and mins and maxes, and to controlled vocabulary lists. Here is an example for Sweet Tools, enforcing proper tool category assignments from a 50-item pick list:
Controlled Vocabularies and Data Entry Validation
  • Specialized functions and macrosall functionality of spreadsheets may be employed in the development of commON datasets. Then, once employed, only the values embedded within the sheets are then exported as CSV.

Staging Sweet Tools for commON

The next major section of the case study deals with the minor conventions that must be followed in order to stage spreadsheets for commON. Not much of the specific commON vocabulary or notation is discussed below; for details, see [7].

Because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.

To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.

In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically ‘escaped‘ when issued as CSV. commON allows you to specify your own delimiter for lists (the standard is the pipe ‘|’ character) and what the parser recognizes as the ‘escape’ character (‘\’ is the standard). However, you probably should not change for most conditions.

The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.

In the irON specification there are a small number of defined modules or processing sections. In commON, these modules are denoted by the double-ampersand character sequence (‘&&‘), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage). Similarly, attributes and types are denoted by a single ampersand prefix (&attributeName).

In commON, any or all of the modules can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.

The RecordList Module

The first spreadsheet figure above shows a Sweet Tools example for the &&recordList module. The module begins with that keyword, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attributeName convention in the columns for the first row after the &&recordList keyword is encountered).

The &&recordList format can also include the stacked style (see similar Dataset example below) in addition to the single row style shown above.

At any rate, once a worksheet is ready with its instance records following the straightforward irON and commON conventions, it can then be saved as a CSV file and appropriately named. Here is an example of what this “vanilla” CSV file now looks like when shown again in a spreadsheet:

Spreadsheet View of the CSV File
(click to expand)

Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:

Editor View of the CSV Record File
(click to expand)

Note that the CSV format separates each column by the comma separator, with escapes shown for the &description attribute when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).

The Dataset Module

The &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset [8]. Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the single row or stacked styles (see these instance record examples), with this one being the stacked style:

The Dataset Module
(click to expand)

The Linkage Module

The &&linkage module is used to map the structure of the instance records to some structural schema, which can also include external ontologies. The module has a simple, but specific structure.

Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted [8]. By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:

The Linkage Module

Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.

The Schema (structure) Module

In its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets [2]. Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.

A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future [8].

Saving and Importing

If the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved. These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.

My own naming convention is to use a format of appname_date_modulename.csv because it sorts well in a file manager accommodating multiple versions (dates) and keeps related files clustered. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format. Thus, in the case of the records listings for Sweet Tools, its filename could be something like:  swt_20091110_records.csv.

Once saved, these files are now ready to be imported into a structWSF [9] instance, which is where the CSV parsing and conversion to interoperable RDF occurs [8]. In this case study, we used the Drupal-based conStruct SCS system [10]. conStruct exposes the structWSF Web services via a user interface and a user permission and access system. The actual case study write-up offers more details about the import process.

Using the Dataset

We are now ready to interact with the Sweet Tools structured dataset using conStruct (assuming you have a Drupal installation with the conStruct modules) [10].

Introduction to the App

The screen capture below shows a couple of aspects of the system:

  • First, the left hand panel (according to how this specific Drupal install was themed) shows the various tools available to conStruct.  These include (with links to their documentation) Search, Browse, View Record, Import, Export, Datasets, Create Record, Update Record, Delete Record and Settings [11];
  • The Browse tree in the main part of the screen shows the full mini-ontology that classifies Sweet Tools. Via simple inferencing, clicking on any parent link displays all children projects for that category as well (click to expand):
conStruct (Drupal) Browse Screen for Sweet Tools(click to expand)

One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!

Some Sample Uses

Here are a few sample things you can do with these generic tools driven by the Sweet Tools dataset:

Note, if you access this conStruct instance you will do so as a demo user. Unfortunately, as such, you may not be able to see all of the write and update tools, which in this case are reserved for curators or admins. Recall that structWSF has a comprehensive user access and permissions layer.

Exporting in Alternative Formats

Of course, one of the real advantages of the irON and structWSF designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations [8]:

  • commON
  • irJSON
  • irXML
  • N-Triples/CSV
  • N-Triples/TSV
  • RDF+N3
  • RDF+XML

As should be obvious, one of the real benefits of the irON notation — in addition to easy dataset authoring — is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.

The Formal Case Study

The formal Sweet Tools case study based on commON, with sample download files and PDF, is available from Annex: A commON Case Study using Sweet Tools, Supplementary Documentation [3].


[1] In 2003, Microsoft estimated its worldwide users of the Excel spreadsheet, which then had about a 90% market share globally, at 400 million. Others at that time estimated unauthorized use to perhaps double that amount. There has been significant growth since then, and online spreadsheets such as Google Docs and Zoho have also grown wildly. This surely puts spreadsheet users globally into the 1 billion range.
[2] See Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009.  See http://openstructs.org/iron/iron-specification. Also see the irON Web site, Google discussion group, and code distribution site.
[3] Michael Bergman, 2009. Annex: A commON Case Study using Sweet Tools, Supplementary Documentation, prepared by Structured Dynamics LLC, November 10, 2009. See http://openstructs.org/iron/common-swt-annex. It may also be downloaded in PDF .
[4] See Michael K. Bergman’s AI3:::Adaptive Information blog, Sweet Tools (Sem Web). In addition, the commON version of Sweet Tools is available at the conStruct site.
[5] The CSV mime type is defined in Common Format and MIME Type for Comma-Separated Values (CSV) Files [RFC 4180]. A useful overview of the CSV format is provided by The Comma Separated Value (CSV) File Format. Also, see that author’s related CTX reference for a discussion of how schema and structure can be added to the basic CSV framework; see http://www.creativyst.com/Doc/Std/ctx/ctx.htm, especially the section on the comma-delimited version (http://www.creativyst.com/Doc/Std/ctx/ctx.htm#CTC).
[6] An attribute-value system is a basic knowledge representation framework comprising a table with columns designating “attributes” (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating “objects” (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table.

Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array..

[7] See especially SUB-PART 3: commON PROFILE in, Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009.
[8] As of the date of this case study, some of the processing steps in the commON pipeline are manual. For example, the parser creates an intermediate N3 file that is actually submitted to the structWSF. Within a week or two of publication, these capabilities should be available as a direct import to a structWSF instance. However, there is one exception to this:  the specification for the schema structure. That module has been prototyped, but will not be released with the first commON upgrade. That enhancement is likely a few weeks off from the date of this posting. Please check the irON or structWSF discussion groups for announcements.
[9] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a structWSF layer may be placed over virtually any existing datastore with Web access — including large instance record stores in existing relational databases — it is also a framework for Web-wide deployments and interoperability.
[10] conStruct SCS is a structured content system built on the Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces. It is based on RDF and SD’s structWSF platform-independent Web services framework [6]. In addition to user access control and management and a general user interface, conStruct provides Drupal-level CRUD, data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF.
[11] More Web services are being added to structWSF on a fairly constant basis, and the existng ones have been through a number of upgrades.
Posted:October 18, 2009

instance record and Object Notation

New Cross-Scripting Frameworks for XML, JSON and Spreadsheets

On behalf of Structured Dynamics, I am pleased to announce our release into the open source community of irON — the instance record and Object Notation — and its family of frameworks and tools [1]. With irON, you can now author and conduct business solely in the formats and tools most familiar and comfortable to you, all the while enabling your data to interact with the semantic Web.

irON is an abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON).

The surprising thing about irON is that — by following its simple conventions and vocabulary — you will be authoring and creating interoperable RDF datasets without doing much different than your normal practice.

This first specification for the irON notation includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. In this newly published irON specificatiion, profiles and examples are also provided for each of the irXML, irJSON and commON serializations. The irON release also includes a number of parsers and converters of the specification into RDF [2]. Data ingested in the irON frameworks can also be exported as RDF and staged as linked data.

UPDATE: Fred Giasson announced on his blog today (10/20) the release of the irJSON and commON parsers.

Background and Rationale

The objective of irON is to make it easy for data owners to author, read and publish data. This means the starting format should be a human readable, easily writable means for authoring and conveying instance records (that is, instances and their attributes and assigned values) and the datasets that contain them. Among other things, this means that irON‘s notation does not use RDF “triples“, but rather the native notations of the host serializations.

irON is premised on these considerations and observations:

  • RDF (Resource Description Framework) is a powerful canonical data model for data interoperability [3]
  • However, most existing data is not written in RDF and many authors and publishers prefer other formats for various reasons
  • Many formats that are easier to author and read than RDF are variants of the attribute-value pair construct [4], which can readily be expressed as RDF, and
  • A common abstract notation for converting to RDF would also enable non-RDF formats to become somewhat interchangeable, thus allowing the strengths of each to be combined.

The irON notation and vocabulary is designed to allow the conceptual structure (“schema”) of datasets to be described, to facilitate easy description of the instance records that populate those datasets, and to link different structures for different schema to one another. In these manners, more-or-less complete RDF data structures and instances can be described in alternate formats and be made interoperable. irON provides a simple and naïve information exchange notation expressive enough to describe most any data entity.

The notation also provides a framework for extending existing schema. This means that irON and its three serializations can represent many existing, common data formats and standards, while also providing a vehicle for extending them. Another intent of the specification is to be sparse in terms of requirements. For instance, this reserved vocabulary is fairly minimal and optional in most all cases. The irON specification supports skeletal submissions.

irON Concepts and Vocabulary

The aim of irON is to describe instance records. An instance record is simply a means to represent and convey the information (”attributes”) describing a given instance. An instance is the thing at hand, and need not represent an individual; it could, for example, represent the entire holdings or collection of books in a given library. Such instance records are also known as the ABox [5]. The simple design of irON is in keeping with the limited roles and work associated with this ABox role.

Attributes provide descriptive characteristics for each instance. Every attribute is matched with a value, which can range from descriptive text strings to lists or numeric values. This design is in keeping with simple attribute-value pairs where, in using the terminology of RDF triples, the subject is the instance itself, the predicate is the attribute, and the object is the value. irON has a vocabulary of about 40 reserved attribute terms, though only two are ever required, with a few others strongly recommended for interoperability and interface rendering purposes.

A dataset is an aggregation of instance records used to keep a reference between the instance records and their source (provenance). It is also the container for transmitting those records and providing any metadata descriptions desired. A dataset can be split into multiple dataset slices. Each slice is written to a file serialized in some way. Each slice of a dataset shares the same <id> of the dataset.

Instances can also be assigned to types, which provide the set or classificatory structure for how to relate certain kinds of things (instances) to other kinds of things. The organizational relationships of these types and attributes is described in a schema. irON also has conventions and notations for describing the linkage of attributes and types in a given dataset to existing schema. These linkages are often mapped to established ontologies.

Each of these irON concepts of records, attributes, types, datasets, schema and linkages share similar notations with keywords signaling to the irON parsers and converters how to interpret incoming files and data. There are also provisions for metadata, name spaces, and local and global references.

In these manners, irON and its three serializations can capture virtually the entire scope and power of RDF as a data model, but with simpler and familiar terminology and constructs expected for each serialization.

The Three Serializations

For different reasons and for different audiences, the formats of XML, JSON and CSV (spreadsheets) were chosen as the representative formats across which to formulate the abstract irON notation.

XML, or eXtensible Markup Language, has become the leading data exchange format and syntax for modern applications. It is frequently adopted by industry groups for standards and standard exchange formats. There is a rich diversity of tools that support the language, importantly including capable parsers and query languages. There is also a serialization of RDF in XML. As implemented in the irON notation, we call this serialization irXML.

JSON, the JavaScript Object Notation, has become very popular as a Web 2.0 data exchange format and is often the format of choice to drive JavaScript applications. There is a growing richness of tools that support JSON, including support from leading Web and general scripting languages such as JavaScript, Python, Perl, Ruby and PHP. JSON is relatively easy to read, and is also now growing in popularity with lightweight databases, such as CouchDB. As implemented in the irON notation, we call this serialization irJSON.

CSV, or comma-separated values, is a format that has been in existence for decades. It was made famous by Microsoft as a spreadsheet exchange format, which makes CSV very useful since spreadsheets are the most prevalent data authoring environment in existence. CSV is less expressive and capable as a data format than the other irON serializations, yet still has a attribute-value pair orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc. As implemented in the irON notation, we call this serialization commON.

The following diagram shows how these three formats relate to irON and then the canonical RDF target data model:

Data transformations path

We have used the unique differences amongst XML, JSON and CSV to guide the embracing abstract notations within irON. Note the round-tripping implications of the framework.

One exciting prospect for the design is how, merely by following the simple conventions within irON, each of these three data formats — and RDF !! — can be used more-or-less interchangeably, and can be used to extend existing schema within their domains.

Links, References and More

This first release of irON is in version 0.8. Updates and revisions are likely with use. Here are some key links for irON:

Mid-week, the parsers and converters for structWSF [6] will be released and announced on Fred Giasson’s blog.

In addition, within the next week we will be publishing a case study of converting the Sweet Tools semantic Web and -related tools dataset to commON.

The irON specification and notation by Structured Dynamics LLC is licensed under a Creative Commons Attribution-Share Alike 3.0. irON‘s parsers or converters are available under the Apache License, Version 2.0.

Editors’ Notes

irON is an important piece in the semantic enterprise puzzle that we are building at Structured Dynamics. It reflects our belief that knowledge workers should be able to author and create interoperable datasets without having to learn the arcana of RDF. At the same time we also believe that RDF is the appropriate data model for interoperability. irOn is an expression of our belief that many data formats have appropriate places and uses; there is no need to insist on a single format.

We would like to thank Dr. Jim Pitman for his advocacy of the importance of human-readable and easily authored datasets and formats. Via his leadership of the Bibliographic Knowledge Network (BKN) project and our contractual relationship with it [7], we have learned much regarding the BKN’s own format, BibJSON. Experience with this format has been a catalytic influence in our own work on irON.

Mike Bergman and Fred Giasson, editors


[1] Please see here for how irON fits within Structured Dynamics’ vision and family of products.
[2] Presently parsers and converters are available for the irJSON and commON serializations, and will be released this week. We have tentatively spec’ed the irXML converter, and would welcome working with another party to finalize a converter. Absent an immediate contribution from a third party, contractual work will likely result in our completing the irXML converter within the reasonable future.
[3] A pivotal premise of irON is the desirability of using the RDF data model as the canonical basis for interoperable data. RDF provides a data model capable of representing any extant data structure and any extant data format. This flexibility makes RDF a perfect data model for federating across disparate data sources. For a detailed discussion of RDF, see Michael K. Bergman, 2009. “Advantages and Myths of RDF,” in AI3 blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[4] An attribute-value system is a basic knowledge representation framework comprising a table with columns designating “attributes” (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating “objects” (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table.

Attribute-values can also be presented as pairs in the form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array.

[5]We use the reference to the “ABox” and “TBox” in accordance with this working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[6] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a structWSF layer may be placed over virtually any existing datastore with Web access — including large instance record stores in existing relational databases — it is also a framework for Web-wide deployments and interoperability.
[7] BKN is a project to develop a suite of tools and services to encourage formation of virtual organizations in scientific communities of various types. BKN is a project started in September 2008 with funding by the NSF Cyber-enabled Discovery and Innovation (CDI) Program (Award # 0835851). The major participating organizations are the American Institute of Mathematics (AIM), Harvard University, Stanford University and the University of California, Berkeley.