Posted:August 9, 2016

Peg ProjectContinued Visibility for the Award-winning Web Portal

Laszlo Pinter, the individual who hired us as the technical contractor for the Peg community portal (www.mypeg.ca), recently gave a talk on the project at a TEDx conference in Winnipeg. Peg is the well-being indicator system for the community of Winnipeg. Laszlo’s talk is a 15-minute, high-level overview of the project and its rationale and role:

Peg helps identify and track indicators that relate to the economic, environmental, cultural and social well-being of the people of Winnipeg. There are scores of connected datasets underneath Peg that relate all information from stories to videos and indicator data to one another using semantic technologies. I first wrote about Peg when it was released at the end of 2013.

In 2014, Peg won the international Community Indicators Consortium Impact Award. The Peg Web site is a joint project of the United Way of Winnipeg (UWW)  and the International Institute for Sustainable Development (IISD). Our company, Structured Dynamics, was the lead developer for the project, which is also based on SD’s Open Semantic Framework (OSF) platform.

Congratulations to the Peg team for the well-deserved visibility!

Posted:June 27, 2016

Org-mode LogoConstant Change Makes Adopting New Practices and Learning New Tools Essential

My business partner, Fred Giasson, has an uncanny ability to sense where the puck is heading. Perhaps that is due to his French Canadian heritage and his love of hockey. I suspect rather it is due to his having his finger firmly on the pulse.

When I first met Fred ten years ago he was already a seasoned veteran of the early semantic Web, with new innovative services at the time such as PingTheSemanticWeb under his belt. More recently, as we at Structured Dynamics began broadening from semantic technologies to more general artificial intelligence, Fred did his patient, quiet research and picked Clojure as our new official development language. We had much invested in our prior code base; switching main programming languages is always one of the most serious decisions a software shop can make. The choice has been brilliant, and our productivity has risen substantially. We are also able to exploit fundamentally new potentials based on a functional programming language that runs in the Java VM and has intellectual roots in Lisp.

As our work continues to shift more to knowledge bases and their use for mapping, classification, tagging, learning, etc., our challenges have been of a still different nature. Knowledge bases used in this manner are not only inherently open world because of the changing knowledge base, but because in staging them for machine learners and training sets, test, build and maintenance scripts and steps are constantly changing. Dealing with knowledge management brings substantial technical debt, and systems and procedures must be in place to deal with that. Literate programming is one means to help.

Because of Clojure and its REPL abilities that enable code to be interpreted and run dynamically at time of input, we also have been looking seriously at the notebook paradigm that has come out of interactive science lab books, and now has such exemplar programs such as iPython Notebook (Jupyter), Org-mode, Wolfram Alpha, Zeppelin, Gorilla, and others [1]. Fred had been interested in and poking at literate programming for quite a while, but his testing and use of Org-mode to keep track of our constant tests and revisions led him to take the question more seriously. You can see this example, a more-detailed example, or still another example of literate programming from Fred’s blog.

Literate programming is even a greater change than switching programming languages. To do literate programming right, there needs to be a focused commitment. To do literate programming right, workflows need to change and new tools must be learned. Is the effort worth it?

Literate Programming

Literate programming is a style of writing code and documentation first proposed by Donald Knuth. As any aspect of a coding effort is written — including its tests, configurations, installation, deployment, maintenance or experiments — written narrative and documentation accompanies it, explaining what it is, the logic of it, and what it is doing and how to exercise it. This documentation far exceeds the best-practices of in-line code commenting.

Literate programming narratives might provide background and thinking about what is being tested or tried, design objectives, workflow steps, recipes or whatever. The style and scope of documentation are similar to what might be expected in a scientist’s or inventor’s lab notebook. Indeed, the breed of emerging electronic notebooks, combined with REPL coding approaches, now enable interactive execution of functions and visualization and manipulation of results, including supporting macros.

Systems that support literate programming, such as Org-mode, can “tangle” their documents to extract the code portions for compilation and execution. They can also “weave” their documents to extract all of the documentation formatted for human readability, including using HTML. Some electronic systems can process multiple programming languages and translate functions. Some electronic systems have built-in spreadsheets and graphing libraries, and most open-source systems can be extended (though with varying degrees of difficulty and in different languages). Some of the systems are designed to interact with or publish Web pages.

Code and programs do not reside in isolation. Their operation needs to be explained and understood by others for bug fixing, use or maintenance. If they are processing systems, there are parameters and input data that are the result of much testing and refinement; they may need to be refined further again. Systems must be installed and deployed. Libraries and languages are frequently being updated for security and performance reasons; executables and environments need to be updated as well. When systems are updated, there are tests that need to be run to check for continued expected performance and accuracy. The severity of some updates may require revision to whole portions of the underlying systems. New employees need tech transfer and training and managers need to know how to take responsibility for the systems. These are all needs that literate programming can help support.

One may argue that transaction systems in more stable environments may have a lesser requirement for literate programming. But, in any knowledge-intensive or knowledge management scenario. the inherent open world nature of knowledge makes something like literate programming an imperative. Everything is in constant flux with a positive need for ongoing updates.

The objective of programmers should not be solely to write code, but to write systems that can be used and re-used to meet desired purposes at acceptable cost. Documentation is integral to that objective. Early experiments need to be improved, codified and documented such that they can be re-implemented across time and environment. Any revision of code needs to be accompanied by a revision or update to documentation. A lines-of-code (LOC) mentality is counter-productive to effective software for knowledge purposes. Literate programming is meant to be the workflow most conducive to achieve these ends.

The Nature of Knowledge

For quite some time now I have made the repeated argument that the nature of knowledge and knowledge management functions compel an emphasis on the open world assumption (OWA) [2]. Though it is a granddaddy of knowledge bases, let’s take the English Wikipedia as an example of why literate programming makes sense for knowledge management purposes. Let’s first look at the nature of Wikipedia itself, and then look at (next section) the various ways it must be processed for KBAI (knowledge-based artificial intelligence) purposes.

The nature of knowledge is that it is constantly changing. We learn new things, understand more about existing things, see relations and connections between things, and find knowledge in other arenas that causes us to re-think what we already knew. Such changes are the definition of an “open world” and pose major challenges to keeping information and systems up to date as these changes constantly flow in the background.

We can illustrate these points by looking at the changes in Wikipedia over a 20-month period from October 2012 to June 2014 [3]. Though growth of Wikipedia has been slowing somewhat since its peak growth years, the kinds of changes seen over this period are fairly indicative of the qualitative and structural changes that constantly affect knowledge.

First, the number of articles in the English Wikipedia (the largest, but only one of the 200+ Wikipedia language versions) increased 12% to over 4.6 million articles over the 20-month period, or greater than 0.6% per month. Actual article changes were greater than this amount. Total churn over the period was about 15.3%, with 13.8% totally new articles and 1.5% deleted articles [3].

Second, even greater changes occurred in Wikipedia’s category structure. More than 25% of the categories were net additions over this period. Another 4% were deleted [3]. Fairly significant categorical changes continue because of a concerted effort by the project to address category problems in the system.

And, third, edits of individual articles continued apace. Over this same period, more than 65 million edits were made to existing English articles, or about 0.75 edits per article per month [3]. Many of these changes, such as link or category or infobox assignments, affect the attributes or characteristics of the article subject, which has a direct effect on KBAI efforts. Also, even text changes affect many of the NLP-based methods for analyzing the knowledge base.

Granted, Wikipedia is perhaps an extreme example given its size and prominence. But the kinds of qualitative and substantive changes we see — new information, deletion of old information, adding or changing specifics to existing information, or changing how information is connected and organized — are common occurrences in any knowledge base.

The implications of working with knowledge bases are clear. KBs are constantly in flux. Single-event, static processing is dated as soon as the procedures are run. The only way to manage and use KB information comes from a commitment to constant processing and updates. Further, with each processing event, more is learned about the nature of the underlying information that causes the processing scripts and methods to need tweaking and refinement. Without detailed documentation of what has been done with prior processing and why, it is impossible to know how to best tweak next steps or to avoid dead-ends or mistakes of the past. KBAI processing can not be cost-effective and responsive without a memory. Literate programming, properly done, provides just that.

The Nature of Systems to Manage Knowledge

Of course, KBAI may also involve multiple input sources, all moving at different speeds of change. There are also multiple steps involved in processing and updating the input information, the “systems”, if you will, required to stage and use the information for artificial intelligence purposes. The artifacts associated with these activities range from functional code and code scripts; to parameter, configuration and build files; to the documentation of those files and scripts; to the logic of the systems; to the process and steps followed to achieve desired results; and to the documentation of the tests and alternatives investigated at any stage in the process. The kicker is that all of these components, without a systematic approach, will need updates, and conventional (non-literate) coding approaches will not be remembered easily, causing costly re-discovery and re-work.

We have tallied up at least ten major steps associated with a processing pipeline for KBAI purposes. I briefly describe each below so to better gain a flavor of this overall flux that needs to be captured by literate programming.

1. Updating Changing Knowledge

The section above dealt with this step, which is ensuring that the input knowledge bases to the overall KBAI process are current and accurate. Depending on the nature of the KM system, there may be multiple input KBs involved, each demanding updates. Besides capturing the changes in the base information itself, many of the steps below may also be required to properly process this changing input knowledge.

2. Processing Input KBs

For KBAI purposes, input KBs must be processed so as to be machine readable. Processing is also desirable to expose features for machine learners and to do other clean up of the input sources, such as removal of administrative categories and articles, cleaning up category structures, consolidating similar or duplicative inputs into canonical forms, and the like.

3. Installing, Running and Updating the System

The KBs themselves require host databases or triple stores. Each of the processing steps may have functional code or scripts associated with it. All general management systems need to be installed, kept current, and secured. The management of system infrastructure sometimes requires a staff of its own, let alone install, deploy, monitoring and update systems.

4. Testing and Vetting Placements

New entities and types added to the knowledge base need to be placed into the overall knowledge graph and tested for logical placement and connections. Though final placements should be manually verified, the sheer number of concepts in the system places a premium on semi-automatic tests and placements. Placement metrics are also important to help screen candidates.

5. Testing and Vetting Mappings

One key aspect of KBAI is its use in interoperating with other datasets and knowledge bases. As a result, new or updated concepts in the KB need to be tested and mapped with appropriate mapping predicates to external or supporting KBs. In the case of UMBEL, Structured Dynamics routinely attempts to map all concepts to Wikipedia (DBpedia), Cyc and Wikidata. Any changes to the base KB causes all of these mappings to be re-investigated and confirmed.

6. Testing and Vetting Assertions

Testing does not end with placements and mappings. Concepts are often characterized by attributes and values; they are often given internal assignments such as SuperTypes; and all of these assertions must be tested against what already exists in the KB. Though the tests may individually be fairly straightforward, there are thousands to test and cross-consistency is important. Each of these assertions is subject to unit tests.

7. Ensuring Completeness

As we have noted elsewhere, our KBAI best practices call for each new concept in the KB to be accompanied by a definition, complete characterization and connections, and synonyms or synsets to aid in NLP tasks. These requirements, too, require scripts and systems for completion.

8. Testing and Vetting Coherence

As the broader structure is built and extended, system tests are applied to ensure the overall graph remains coherent and that outliers, orphans and fragments are addressed. Some of this testing is done via component typologies, and some occurs using various network and graph analyses. Possible problems need to be flagged and presented for manual inspection. Like other manual vetting requirements, confidence scoring and ranking of problems and candidates speed up the screening process.

9. Generating Training Sets

A key objective of the KBAI approach is the creating of positive and negative training sets. Candidates need to be generated; they need to be scored and tested; and their final acceptance needs to be vetted. Once vetted, the training sets themselves may need to be expressed in different formats or structures (such as finite state transducers, one of the techniques we often use) in order for them to be performant in actual analysis or use.

10. Testing and Vetting Learners

Machine learners can then be applied to the various features and training sets produced by the system. Each learning application involves the testing of one or more learners; the varying of input feature or training sets; and the testing of various processing thresholds and parameters (including possibly white and black lists). This set of requirements is one of the most intensive on this listing, and definitely requires documentation of test results, alternatives tested, and other observations useful to cost-effective application.

Rinse and Repeat

Each of these 10 steps is not a static event. Rather, given the constant change inherent to knowledge sources, the entire workflow must be repeated on a periodic basis. In order to reduce the tension between updating effort and current accuracy, the greater automation of steps with complete documentation is essential. A lack of automation leads to outdated systems because of the effort and delays in updates. The imperative for automation, then, is a function of the change frequency in the input KBs.

KBAI, perhaps at the pinnacle of knowledge management services, requires more of these steps and perhaps more frequent updates. But any knowledge management activity will incur a portion of these management requirements.

Yes, Literate Programming is Worth It

As I stated in a prior article in this series [4], “The only sane way to tackle knowledge bases at these structural levels is to seek consistent design patterns that are easier to test, maintain and update. Open world systems must embrace repeatable and largely automated workflow processes, plus a commitment to timely updates, to deal with the constant, underlying change in knowledge.” Literate programming is, we have come to believe, one of the integral ways to keep sane.

The effort to adopt literate programming is justified. But, as Fred noted in one of his recent posts, literate programming does impose a cost on teams and requires some cultural and mindset changes. However, in the context of KBAI, these are not simply nice-to-haves, they are imperatives.

Choice of tools and systems thus becomes important in supporting a literate programming environment. As Fred has noted, he has chosen Org-mode for Structured Dynamics’ literate programming efforts. Besides Org-mode, that also (generally) requires adoption by programmers of the Emacs editor. Both of these tools are a bit problematic for non-programmers.

Though no literate programming tools yet support WOPE (write once, publish everywhere), they can make much progress toward that goal. By “weaving” we can create standalone documentation. With the converter tool Pandoc, we can make (mostly) accurate translations of documents in dozens of formats against one another. The system is open and can be extended. Pandoc works best with lightweight markup formats like Org-mode, Markdown, wikitext, Textile, and others.

We’re still working hard on the tooling infrastructure surrounding literate programming. We like the interactive notebooks approach, and also want easy and straightforward ways to deploy code snippets, demos and interactive Web pages.

Because of the factors outlined in this article, we see a renewed emphasis on literate programming. That, combined with the Web and its continued innovations, would appear to point to a future rich in responsive tooling and workflows geared to the knowledge economy.


[1] Other known open source electronic lab notebook options include Beaker Notebook, Flow, nteract, OpenWetWare, Pineapple, Rodeo, RStudio, SageMathCloud, Session, Shiny, Spark Notebook, and Spyder, among others certainly missed. Terminology for these apps includes notebook, electronic lab notebook, data notebook, and data scientist notebook.
[2] See M. K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.
OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption.
Also, you can search on OWA on this blog.
[3] See Ramakrishna B. Bairi, Mark Carman and Ganesh Ramakrishnan, 2015. “On the Evolution of Wikipedia: Dynamics of Categories and Articles,” Wikipedia, a Social Pedia: Research Challenges and Opportunities: Papers from the 2015 ICWSM Workshop; also, see https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
[4] M.K. Bergman, 2016. “Rationales for Typology Designs in Knowledge Bases,” AI3:::Adaptive Information blog, June 6, 2016.
Posted:May 11, 2016

UMBEL - Upper Mapping and Binding Exchange Layer Version 1.50 Fully Embraces a Typology Design, Gets Other Computability Improvements

The year since the last major release of UMBEL (Upper Mapping and Binding Exchange Layer) has been spent in a significant re-think of how the system is organized. Four years ago, in version 1.05, we began to split UMBEL into a core and a series of swappable modules. The first module adopted was in geographical information; the second was in attributes. This design served us well, but it was becoming apparent that we were on a path of multiple modules. Each of UMBEL’s major so-called ‘SuperTypes‘ — that is, major cleavages of the overall UMBEL structure that are largely disjoint from one another, such as between Animals and Facilities — were amenable to the module design. This across-the-board potential cleavage of the UMBEL system caused us to stand back and question whether a module design alone was the best approach. Ultimately, after much thought and testing, we adopted instead a typology design that brought additional benefits beyond simple modularity.

Today, we are pleased to announce the release of these efforts in UMBEL version 1.50. Besides standard release notes, this article discusses this new typology design, and explains its uses and benefits.

Basic UMBEL Background

The Web and enterprises in general are characterized by growing, diverse and distributed information sources and data. Some of this information resides in structured databases; some resides in schema, standards, metadata, specifications and semi-structured sources; and some resides in general text or media where the content meaning is buried in unstructured form. Given these huge amounts of information, how can one bring together what subsets are relevant? And, then for candidate material that does appear relevant, how can it be usefully combined or related given its diversity? In short, how does one go about actually combining diverse information to make it interoperable and coherent?

UMBEL thus has two broad purposes. UMBEL’s first purpose is to provide a general vocabulary of classes and predicates for describing and mapping domain ontologies, with the specific aim of promoting interoperability with external datasets and domains. UMBEL’s second purpose is to provide a coherent framework of reference subjects and topics for grounding relevant Web-accessible content. UMBEL presently has about 34,000 of these reference concepts drawn from the Cyc knowledge base, organized into 31 mostly disjoint SuperTypes.

The grounding of information mapped by UMBEL occurs by common reference to the permanent URIs (identifiers) for UMBEL’s concepts. The connections within the UMBEL upper ontology enable concepts from sources at different levels of abstraction or specificity to be logically related. Since UMBEL is an open source extract of the OpenCyc knowledge base, it can also take advantage of the reasoning capabilities within Cyc.

UMBEL in Linked Open Data

Diagram showing linked data datasets. UMBEL is near the hub, below and to the right of the central DBpedia.

UMBEL’s vocabulary is designed to recognize that different sources of information have different contexts and different structures, and meaningful connections between sources are not always exact. UMBEL’s 34,000 reference concepts form a knowledge graph of subject nodes that may be related to external classes and individuals (instances and entities). Via this coherent structure, we gain some important benefits:

  • Mapping to other ontologies — disparate and heterogeneous datasets and ontologies may be related to one another by mapping to the UMBEL structure
  • A scaffolding for domain ontologies — more specific domain ontologies can be made interoperable by using the UMBEL vocabulary and tieing their more general concepts into the UMBEL structure
  • Inferencing — the UMBEL reference concept structure is coherent and designed for inferencing, which supports better semantic search and look-ups
  • Semantic tagging — UMBEL, and ontologies mapped to it, can be used as input bases to ontology-based information extraction (OBIE) for tagging text or documents; UMBEL’s “semsets” broaden these matches and can be used across languages
  • Linked data mining — via the reference ontology, direct and related concepts may be retrieved and mined and then related to one another
  • Creating computable knowledge bases — with complete mappings to key portions of a knowledge base, say, for Wikipedia articles, it is possible to use the UMBEL graph structure to create a computable knowledge source, with follow-on benefits in artificial intelligence and KB testing and improvements, and
  • Categorizing instances and named entities — UMBEL can bring a consistent framework for typing entities and relating their descriptive attributes to one another.

UMBEL is written in the semantic Web languages of SKOS and OWL 2. It is a class structure used in linked data, along with other reference ontologies. Besides data integration, UMBEL has been used to aid concept search, concept definitions, query ranking, ontology integration, and ontology consistency checking. It has also been used to build large ontologies and for online question answering systems [1].

Including OpenCyc, UMBEL has about 65,000 formal mappings to DBpedia, PROTON, GeoNames, and schema.org, and provides linkages to more than 2 million Wikipedia pages (English version). All of its reference concepts and mappings are organized under a hierarchy of 31 different SuperTypes, which are mostly disjoint from one another. Development of UMBEL began in 2007. UMBEL was first released in July 2008. Version 1.00 was released in February 2011.

Summary of Version 1.50 Changes

These are the principal changes between the last public release, version 1.20, and this version 1.50. In summary, these changes include:

  • Removed all instance or individual listings from UMBEL; this change does NOT affect the punning used in UMBEL’s design (see Metamodeling in Domain Ontologies)
  • Re-aligned the SuperTypes to better support computability of the UMBEL graph and its resulting disjointedness
  • These SuperTypes were eliminated with concepts re-assigned: Earthscape, Extraterrestrial, Notations and Numbers
  • These new SuperTypes were introduced: AreaRegion, AtomsElements, BiologicalProcesses, Forms, LocationPlaces, and OrganicChemistry, with logically reasoned assignments of RefConcepts
  • The Shapes SuperType is a new ST that is inherently non-disjoint because it is shared with about half of the RefConcepts
  • The Situations is an important ST, overlooked in prior efforts, that helps better establish context for Activities and Events
  • Made re-alignments in UMBEL’s upper structure and introduced additional upper-level categories to better accommodate these refinements in SuperTypes
  • A typology was created for each of the resulting 31 disjoint STs, which enabled missing concepts to be identified and added and to better organize the concepts within each given ST
  • The broad adoption of the typology design for all of the (disjoint) SuperTypes also meant that prior module efforts, specifically Geo and Attributes, could now be made general to all of UMBEL. This re-integration also enabled us to retire these older modules without affecting functionality
  • The tests and refinements necessary to derive this design caused us to create flexible build and testing scripts, documented via literate programming (using Clojure)
  • Updated all mappings to DBpedia, Wikipedia, and schema.org
  • Incorporated donated mappings to five additional LOV vocabularies [2]
  • Tested the UMBEL structure for consistency and coherence
  • Updated all prior UMBEL documentation
  • Expanded and updated the UMBEL.org Web site, with access and demos of UMBEL.

UMBEL’s SuperTypes

The re-organizations noted above have resulted in some minor changes to the SuperTypes and how they are organized. These changes have made UMBEL more computable with a higher degree of disjointedness between SuperTypes. (Note, there are also organizational SuperTypes that work largely to aid the top levels of the knowledge graph, but are explicitly designed to NOT be disjoint. Important SuperTypes in this category include Abstractions, Attributes, Topics, Concepts, etc. These SuperTypes are not listed below.)

UMBEL thus now has 31 largely disjoint SuperTypes, organized into 10 or so clusters or “dimensions”:

Constituents
Natural Phenomena
Area or Region
Location or Place
Shapes
Forms
Situations
Time-related
Activities
Events
Times
Natural Matter
Atoms and Elements
Natural Substances
Chemistry
Organic Matter
Organic Chemistry
Biochemical Processes
Living Things
Prokaryotes
Protists & Fungus
Plants
Animals
Diseases
Agents
Persons
Organizations
Geopolitical
Artifacts
Products
Food or Drink
Drugs
Facilities
Information
Audio Info
Visual Info
Written Info
Structured Info
Social
Finance & Economy
Society

These disjoint SuperTypes provide the basis for the typology design described next.

The Typology Design

After a few years of working with SuperTypes it became apparent each SuperType could become its own “module”, with its own boundaries and hierarchical structure. Since across the UMBEL structure nearly 90% of the reference concepts are themselves entity classes, if these are properly organized, we can achieve a maximum of disjointness, modularity, and reasoning efficiency. Our early experience with modules pointed the way to a design for each SuperType that was as distinct and disjoint from other STs as possible. And, through a logical design of natural classes [3] for the entities in that ST, we could achieve a flexible, ‘accordion-like’ design that provides entity tie-in points from the general to the specific for each given SuperType. The design is effective for being able to interoperate across both fine-grained and coarse-grained datasets. For specific domains, the same design approach allows even finer-grained domain concepts to be effectively integrated.

All entity classes within a given SuperType are thus organized under the SuperType itself as the root. The classes within that ST are then organized hierarchically, with children classes having a subClassOf relation to their parent. Each class within the typology can become a tie-in point for external information, providing a collapsible or expandable scaffolding (the ‘accordion’ design). Via inferencing, multiple external sources may be related to the same typology, even though at different levels of specificity. Further, very detailed class structures can also be accommodated in this design for domain-specific purposes. Moreover, because of the single tie-in point for each typology at its root, it is also possible to swap out entire typology structures at once, should design needs require this flexibility.

We have thus generalized the earlier module design to where every (mostly) disjoint SuperType now has its own separate typology structure. The typologies provide the flexible lattice for tieing external content together at various levels of specificity. Further, the STs and their typologies may be removed or swapped out at will to deal with specific domain needs. The design also dovetails nicely with UMBEL’s build and testing scripts. Indeed, the evolution of these scripts via literate programming has also been a reinforcing driver for being able to test and refine the complete ST and typologies structure.

Still a Work in Progress

Though UMBEL retains its same mission as when the system was first formulated nearly a decade ago, we also see its role expanding. The two key areas of expansion are in UMBEL’s use to model and map instance data attributes and in acting as a computable overlay for Wikipedia (and other knowledge bases). These two areas of expansion are still a work in progress.

The mapping to Wikipedia is now about 85% complete. While we are testing automated mapping mechanisms, because of its central role we also need to vet all UMBEL-Wikipedia mapping assignments. This effort is pointing out areas of UMBEL that are over-specified, under-specified, and sometimes duplicative or in error. Our goal is to get to a 100% coverage point with Wikipedia, and then to exercise the structure for machine learning and other tests against the KB. These efforts will enable us to enhance the semsets in UMBEL as well as to move toward multilingual versions. This effort, too, is still a work in progress.

Despite these desired enhancements, we are using all aspects of UMBEL and its mappings to both aid these expansions and to test the existing mappings and structure. These efforts are proving the virtuous circle of improvements that is at the heart of UMBEL’s purposes.

Where to Get UMBEL and Learn More

The UMBEL Web site provides various online tools and Web services for exploring and using UMBEL. The UMBEL GitHub site is where you can download the UMBEL Vocabulary or the UMBEL Reference Concept ontology, both under a Creative Commons Attribution 3.0 license. Other documents and backup are also available from that location.

Technical specifications for UMBEL and its various annexes are available from the UMBEL wiki site. You can also download a PDF version of the specifications from there. You are also welcomed to participate on the UMBEL mailing list or LinkedIn group.


[2] Courtesy of Jana Vataščinová (University of Economics, Prague) and Ondřej Zamazal (University of Economics, Prague, COSOL project).
[3] See, for example, M.K. Bergman, 2015. “‘Natural Classes’ in the Knowledge Web,” AI3:::Adaptive Information blog, July 13, 2015.

Posted by AI3's author, Mike Bergman Posted on May 11, 2016 at 8:55 am in Linked Data, Structured Dynamics, UMBEL | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/1946/new-major-upgrade-of-umbel-released/
The URI to trackback this post is: https://www.mkbergman.com/1946/new-major-upgrade-of-umbel-released/trackback/
Posted:December 10, 2014

ClojureTen Management Reasons for Choosing Clojure for Adaptive Knowledge Apps

It is not unusual to see articles touting one programming language or listing the reasons for choosing another, but they are nearly always written from the perspective of the professional developer. As an executive with much experience directing software projects, I thought a management perspective could be a useful addition to the dialog. My particular perspective is software development in support of knowledge management, specifically leveraging artificial intelligence, semantic technologies, and data integration.

Context is important in guiding the selection of programming languages. C is sometimes a choice for performance reasons, such as in data indexing or transaction systems. Java is the predominant language for enterprise applications and enterprise-level systems. Scripting languages are useful for data migrations and translations and quick applications. Web-based languages of many flavors help in user interface development or data exchange. Every one of the hundreds of available programming languages has a context and rationale that is argued by advocates.

We at Structured Dynamics have recently made a corporate decision to emphasize the Clojure language in the specific context of knowledge management development. I’d like to offer our executive-level views for why this choice makes sense to us. Look to the writings of SD’s CTO, Fred Giasson, for arguments related to the perspective of the professional developer.

Some Basic Management Premises

I have overseen major transitions in programming languages from Fortran or Cobol to C, from C to C++ and then Java, and from Java to more Web-oriented approaches such as Flash, JavaScript and PHP [1]. In none of these cases, of course, was a sole language adopted by the company, but there always was a core language choice that drove hiring and development standards. Making these transitions is never easy, since it affects employee choices, business objectives and developer happiness and productivity. Because of these trade-offs, there is rarely a truly “correct” choice.

Languages wax and wane in popularity, and market expectations and requirements shift over time. Twenty years ago, Java brought forward a platform-independent design well-suited for client-server computing, and was (comparatively) quickly adopted by enterprises. At about the same time Web developments added browser scripting languages to the mix. Meanwhile, hardware improvements overcame many previous performance limitations in favor of easier to use syntaxes and tooling. No one hardly programs for an assembler anymore. Sometimes, like Flash, circumstances and competition may lead to a rapid (and unanticipated) abandonment.

The fact that such transitions naturally occur over time, and the fact that distributed and layered architectures are here to stay, has led to my design premise to emphasize modularity, interfaces and APIs [2]. From the browser, client and server sides we see differential timings of changes and options. It is important that piece parts be able to be swapped out in favor of better applications and alternatives. Irrespective of language, architectural design holds the trump card in adaptive IT systems.

Open source has also had a profound influence on these trends. Commercial and product offerings are no longer monolithic and proprietary. Rather, modern product development is often more based on assembly of a diversity of open source applications and libraries, likely written in a multitude of languages, which are then assembled and “glued” together, often with a heavy reliance on scripting languages. This approach has certainly been the case with Structured Dynamics’ Open Semantic Framework, but OSF is only one example of this current trend.

The trend to interoperating open source applications has also raised the importance of data interoperability (or ETL in various guises) plus reconciling semantic heterogeneities in the underlying schema and data definitions of the contributing sources. Language choices increasingly must recognize these heterogeneities.

I have believed strongly in the importance of interest and excitement by the professional developers in the choice of programming languages. The code writers — be it for scripting or integration or fundamental app development — know the problems at hand and read about trends and developments in programming languages. The developers I have worked with have always been the source of identifying new programming options and languages. Professional developers read much and keep current. The best programmers are always trying and testing new languages.

I believe it is important for management within software businesses to communicate anticipated product changes within the business to its developers. I believe it is important for management to signal openness and interest in hearing the views of its professional developers in language trends and options. No viable software development company can avoid new upgrades of its products, and choices as to architecture and language must always be at the forefront of version planning.

When developer interest and broader external trends conjoin, it is time to do serious due diligence about a possible change in programming language. Tooling is important, but not dispositive. Tooling rapidly catches up with trending and popular new languages. As important to tooling is the “fit” of the programming language to the business context and the excitement and productivity of the developers to achieve that fit.

Fitness” is a measure of adaptiveness to a changing environment. Though I suppose some of this can be quantified — as it can in evolutionary biology — I also see “fit” as a qualitative, even aesthetic, thing. I recall sensing the importance of platform independence and modularity in Java when it first came out, results (along with tooling) that were soon borne out. Sometimes there is just a sense of “rightness” or alignment when contemplating a new programming language for an emerging context.

Such is the case for us at Structured Dynamics with our recent choice to emphasize Clojure as our core development language. This choice does not mean we are abandoning our current code base, just that our new developments will emphasize Clojure. Again, because of our architectural designs and use of APIs, we can make these transitions seamlessly as we move forward.

However, for this transition, unlike prior ones I have made noted above, I wanted to be explicit as to the reasons and justifications. Presenting these reasons for Clojure is the purpose of this article.

Brief Overview of Clojure

Clojure is a relatively new language, first released in 2007 [3]. Clojure is a dialect of Lisp, explicitly designed to address some Lisp weaknesses in a more modern package suitable to the Web and current, practical needs. Clojure is a functional programming language, which means it has roots in the lamdba calculus and functions are “first-class citizens” in that they can be passed as arguments to other functions, returned as values, or assigned as variables in data structures. These features make the language well suited to mathematical manipulations and the building up of more complicated functions from simpler ones.

Clojure was designed to run on the Java Virtual Machine (JVM) (now expanded to other environments such as ClojureScript and Clojure CLR) rather than any specific operating system. It was designed to support concurrency. Other modern features were added in relation to Web use and scalability. Some of these features are elaborated in the rationales noted below.

As we look to the management reasons for selecting Clojure, we can really lump them into two categories: a) those that arise mostly from Lisp, as the overall language basis; and b) specific aspects added to Clojure that overcome or enhance the basis of Lisp.

Reasons Deriving from Lisp

Lisp (defined as a list processing language) is one of the older computer languages around, dating back to 1958, and has evolved to become a family of languages. “Lisp” has many variants, with Common Lisp one of the most prevalent, and many dialects that have extended and evolved from it.

Lisp was invented as a language for expressing algorithms. Lisp has a very simple syntax and (in base form) comparatively few commands. Lisp syntax is notable (and sometimes criticized) for its common use of parentheses, in a nested manner, to specify the order of operations.

Lisp is often associated with artificial intelligence programming, since it was first specified by John McCarthy, the acknowledged father of AI, and was the favored language for many early AI applications. Many have questioned whether Lisp has any special usefulness to AI or not. Though it is hard to point to any specific reason why Lisp would be particularly suited to artificial intelligence, it does embody many aspects highly useful to knowledge management applications. It was these reasons that caused us to first look at Lisp and its variants when we were contemplating language alternatives:

  1. Open — I have written often that knowledge management, given its nature grounded in the discovery of new facts and connections, is well suited to being treated under the open world assumption [4]. OWA is a logic that explicitly allows the addition of new assertions and connections, built upon a core of existing knowledge. In both optical and literal senses, the Lisp language is an open one that is consistent with OWA. The basis of Lisp has a “feel” much like RDF (Resource Description Framework), the prime language of OWA. RDF is built from simple statements, as is Lisp. New objects (data) can be added easily to both systems. Adding new predicates (functions) is similarly straightforward. Lisp’s nested parentheses also recall Boolean logic, another simple means for logically combining relationships. As with semantic technologies, Lisp just seems “right” as a framework for addressing knowledge problems;
  2. Extensible — one of the manifestations of Lisp’s openness is its extensiblity via macros. Macros are small specifications that enable new functions to be written. This extensibility means that a “bottom up” programming style can be employed [5] that allows the syntax to be expanded with new language to suit the problem, leading in more complete cases to entirely new — and tailored to the problem — domain-specific languages (DSLs). As an expression of Lisp’s openness, this extensibility means that the language itself can morph and grow to address the knowledge problems at hand. And, as that knowledge continues to grow, as it will, so may the language by which to model and evaluate it;
  3. Efficient — Lisp, then, by definition, has a minimum of extraneous code functions and, after an (often steep) initial learning curve, rapid development of appropriate ones for the domain. Since anything Lisp can do to a data structure it can do to code, it is an efficient language. When developing code, Lisp provides a read-eval-print-loop (REPL) environment that allows developers to enter single expressions (s-expressions) and evaluate them at the command line, leading to faster and more efficient ways to add features, fix bugs, and test code. Accessing and managing data is similarly efficient, along with code and data maintenance. A related efficiency with Lisp is lazy evaluation, wherein only the given code or data expression at hand is evaluated as its values are needed, rather than building evaluation structures in advance of execution;
  4. Data-centric — the design aspect of Lisp that makes many of these advantages possible is its data-centric nature. This nature comes from Lisp’s grounding in the lambda calculus and its representation of all code objects via an abstract syntax tree. These aspects allow data to be immutable, and for data to act as code, and code to act as data [6]. Thus, while we can name and reference data or functions as in any other language, they are both manipulated and usable in the same way. Since knowledge management is the combination of schema with instance data, Lisp (or other homoiconic languages) is perfectly suited; and,
  5. Malleable — a criticism of Lisp through the decades has been its proliferation of dialects, and lack of consistency between them. This is true, and Lisp is therefore not likely the best vehicle in its native form for interoperability. (It may also not be the best basis for large-scale, complicated applications with responsive user interfaces.) But for all of the reasons above, Lisp can be morphed into many forms, including the manipulation of syntax. In such malleable states, the dialect maintains its underlying Lisp advantages, but can be differently purposed for different uses. Such is the case with Clojure, discussed in the next section.

Reasons Specific to the Clojure Language

Clojure was invented by Rich Hickey, who knew explicitly what he wanted to accomplish leveraging Lisp for new, more contemporary uses [7]. (Though some in the Lisp community have bristled against the idea that dialects such as Common Lisp are not modern, the points below really make a different case.) Some of the design choices behind Clojure are unique and quite different from the Lisp legacy; others leverage and extend basic Lisp strengths. Thus, with Clojure, we can see both a better Lisp, at least for our stated context, and one designed for contemporary environments and circumstances.

Here are what I see as the unique advantages of Clojure, again in specific reference to the knowledge management context:

  1. Virtual machine design (interoperability) — the masterstroke in Clojure is to base it upon the Java Virtual Machine. All of Lisp’s base functions were written to use the JVM. Three significant advantages accrue from this design. First, Clojure programs can be compiled as jar files and run interactively with any other Java program. In the context of knowledge management and semantic uses, fully 60% of existing applications can now interoperate with Clojure apps [8], an instant boon for leveraging many open source capabilities. Second, certain advantages from Java, including platform independence and the leverage of debugging and profiling tools, among others (see below), are gained. And, third, this same design approach has been applied to integrating with JavaScript (via ClojureScript) and the Common Language Runtime execution engine of Microsoft’s .Net Framework (via Clojure CLR), both highly useful for Web purposes and as exemplars for other integrations;
  2. Scalable performance — by leveraging Java’s multithreaded and concurrent design, plus a form of caching called memoization in conjunction with the lazy evaluation noted above, Clojure gains significant performance advantages and scalability. The immutability of Clojure data has minimal data access conflicts (it is thread-safe), further adding to performance advantages. We (SD) have yet to require such advantages, but it is a comfort to know that large-scale datasets and problems likely have a ready programmatic solution when using Clojure;
  3. More Lispiness — Clojure extends the basic treatment of Lisp’s s-expressions to maps and vectors, basically making all core constructs into extensible abstractions; Clojure is explicit in how metadata and reifications can be applied using Lisp principles, really useful for semantic applications; and Clojure EDN (Extensible Data Notation) was developed as a Lisp extension to provide an analog to JSON for data specification and exchange using ClojureScript [9]. SD, in turn, has taken this approach and extended it to work with the OSF Web services [10]. These extensions go to the core of the Lisp mindset, again reaffirming the malleability and extensibility of Lisp;
  4. Process-oriented — many knowledge management tasks are themselves the results of processing pipelines, or are semi-automatic in nature and require interventions by users or subject matter experts in filtering or selecting results. Knowledge management tasks and the data pre-processing that precedes them often require step-wise processes or workflows. The immutability of data and functions in Lisp means that state is also immutable. Clojure takes advantage of these Lisp constructs to make explicit the treatment of time and state changes. Further, based on Hickey’s background in scheduling systems, a construct of transducers [11] is being introduced in version 1.70. The design of these systems is powerful for defining and manipulating workflows and rule-based branching. Data migration and transformations benefit from this design; and
  5. Active and desirable — hey, developers want to work with this stuff and think it is fun. SD’s clients and partners are also desirous of working with program languages attuned to knowledge management problems, diverse native data, and workflows controlled and directed by knowledge workers themselves. These market interests are key affirmations that Clojure, or dialects similar to it, are the wave of the future for knowledge management programming. Combined with its active developer community, Clojure is our choice for KM applications for the foreseeable future.

Clojure for Knowledge Apps

I’m sure had this article been written from a developer’s perspective, different emphases and different features would have arisen. There is no perfect programming language and, even if there were, its utility would vary over time. The appropriateness of program languages is a matter of context. In our context of knowledge management and artificial intelligence applications, Clojure is our due diligence choice from a business-level perspective.

There are alternatives to the points raised herein, like Scheme, Erlang or Haskell. Scala offers some of the same JVM benefits as noted. Further, tooling for Clojure is still limited (though growing), and it requires Java to run and develop. Even with extensions and DSLs, there is still the initial awkwardness of learning Lisp’s mindset.

Yet, ultimately, the success of a programming language is based on its degree of use and longevity. We are already seeing very small code counts and productivity from our use of Clojure. We are pleased to see continued language dynamism from such developments as Transit [9] and transducers [11]. We think many problem areas in our space — from data transformations and lifting, to ontology mapping, and then machine learning and AI and integrations with knowledge bases, all under the control of knowledge workers versus developers — lend themselves to Clojure DSLs of various sorts. We have plans for these DSLs and look forward to contribute them to the community.

We are excited to now find an aesthetic programming fit with our efforts in knowledge management. We’d love to see Clojure become the go-to language for knowledge-based applications. We hope to work with many of you in helping to make this happen.


[1] I have also been involved with the development of two new languages, Promula and VIQ, and conducted due diligence on C#, Ruby and Python, but none of these languages were ultimately selected.
[2] Native apps on smartphones are likely going through the same transition.
[3] As of the date of this article, Clojure is in version 1.60.
[4] See M. K. Bergman, 2009. ” The Open World Assumption: Elephant in the Room,” December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.
OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption.
Also, you can search on OWA on this blog.
[5] Paul Graham, 1993. “Programming Bottom-Up,” is a re-cap on Graham’s blog related to some of his earlier writings on programming in Lisp. By “bottom up” Graham means “. . . changing the language to suit the problem. . . . Language and program evolve together.”
[6] A really nice explanation of this approach is in James Donelan, 2013. “Code is Data, Data is Code,” on the Mulesoft blog, September 26, 2013.
[7] Rich Hickey is a good public speaker. Two of his seminal videos related to Clojure are “Are We There Yet?” (2009) and “Simple Made Easy” (2011).
[8] My Sweet Tools listing of knowledge management software is dominated by Java, with about half of all apps in that language.
[9] See the Clojure EDN; also Transit, EDN’s apparent successor.
[10] structEDN is a straightforward RDF serialization in EDN format.
[11] For transducers in Clojure version 1.70, see this Hickey talk, “Transducers” (2014).
Posted:October 16, 2014

AI3 Pulse

Peg, the well-being indicator system for the community of Winnipeg, recently won the international Community Indicators Consortium Impact Award, presented in Washington, D.C. on Sept. 30. Peg is a joint project of the United Way of Winnipeg (UWW)  and the International Institute for Sustainable Development (IISD). Our company, Structured Dynamics, was the lead developer for the project, which is also based on SD’s Open Semantic Framework (OSF) platform.
Peg Project
Peg is an innovative Web portal that helps identify and, on an ongoing basis, track indicators that relate to the economic, environmental, cultural and social well-being of the people of Winnipeg. Datasets may be selected and compared with a variety of charting and mapping and visualization tools at the level of the entire city, neighborhoods or communities. All told, there are now more than 60 indicators within Peg ranging from active transportation to youth unemployment rates representing thousands of individual records and entities. We last discussed Peg upon its formal release in December 2013.

Congratulations to all team members!

Posted by AI3's author, Mike Bergman Posted on October 16, 2014 at 3:24 pm in Open Semantic Framework, Structured Dynamics | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/1814/peg-wins-international-award/
The URI to trackback this post is: https://www.mkbergman.com/1814/peg-wins-international-award/trackback/