Refining a Generator, Testing the Coherence, and Computing a KB
The six months since the last major release of UMBEL (Upper Mapping and Binding Exchange Layer) have been spent in improving the coherence and broadening the usefulness for the ontology. Structured Dynamics is today releasing version 1.20 of the open source UMBEL.
UMBEL’s first purpose is to provide a general vocabulary of classes and predicates for describing domain ontologies, with the specific aim of promoting interoperability with external datasets and domains. UMBEL’s second purpose is to provide a coherent framework of reference subjects and topics for grounding relevant Web-accessible content. UMBEL presently has about 35,000 of these reference concepts drawn from the Cyc knowledge base, split into ‘core’ and a series of optional modules, which are organized into 32 mostly disjoint SuperTypes.
The key advances in this new 1.20 version of UMBEL include refinements to the UMBEL generator, improved tests for satisfiabliity and coherence, and additional mappings and structure to aid UMBEL’s role as a computing overlay for existing knowledge bases, such as Wikipedia. Part of the latter advance is being aided by the new addition of an Attributes Ontology to UMBEL as described in the prior articles of An UMBEL Extension for Attributes and Conceptual and Practical Distinctions in the Attributes Ontology.
Summary of Changes
These are the principal changes between the last public release, version 1.10, and this version 1.20:
- Expanded mappings to OpenCyc to better capture coverage of Wikipedia content; there are now 35,533 reference concepts (RCs) in UMBEL, 35,302 of which are mapped to OpenCyc (the unmapped RCs are mostly used for organizational purposes in the Attributes Ontology and OpenCyc mismatches with key external ontologies)
- Created a new Attributes Ontology (AO), with the purpose of enabling property (attribute) mappings to UMBEL (see further the UMBEL Annex L discussion for more details on this version update)
- Created a new Attributes module, with 1,002 RCs assigned
- Created a new
EntitiesSuperType, with 20,393 RCs designated. The Entities ST is by definition non-disjoint with UMBEL’s other SuperTypes
- Created a new Entities module, with 9,317 RCs assigned; the remainder of the Entites RCs are in core
- Expanded the direct UMBEL RC to Wikipedia page mappings, with 25,582 currently mapped, or nearly three-quarters (72%) of RCs now assigned
- Created a new Annex Z to hold updated statistics about UMBEL
- Deprecated the
WorkplacesSuperType, and merged with the
- Deprecated the
MarketIndustriesSuperType, and merged with the
- Reviewed and greatly improved ST assignments across the board; notably, the distinction between the
ActivitiesSuperTypes was improved. See Annex Z for the updated ST assignment statistics
- Greatly expanded and improved the UMBEL generator to handle satisfiability tests and modules creation
- Expanded and updated the UMBEL.org Web site.
A Short Primer on UMBEL
The Web and enterprises in general are characterized by growing, diverse and distributed information sources and data. Some of this information resides in structured databases; some resides in schema, standards, metadata, specifications and semi-structured sources; and some resides in general text or media where the content meaning is buried in unstructured form. Given these huge amounts of information, how can one bring together what subsets are relevant? And, then for candidate material that does appear relevant, how can it be usefully combined or related given its diversity? In short, how does one go about actually combining diverse information to make it interoperable and coherent?
UMBEL was conceived to provide a reference grounding to achieve these very aims. UMBEL’s vocabulary is designed to recognize that different sources of information have different contexts and different structures, and meaningful connections between sources are not always exact. UMBEL’s 35,000 reference concepts — drawn from the logically consistent Cyc knowledge base backed by 1000 person-years of development and testing — provide a set of fixed references by which we can orient, map and navigate external content. These UMBEL reference concepts form a knowledge graph (you can see a big graph visualization of this structure) of subject nodes that may be related to external classes and individuals (instances and named entities). Via this coherent structure, we gain some important benefits:
- Mapping to other ontologies — disparate and heterogeneous datasets and ontologies may be related to one another by mapping to the UMBEL structure
- A scaffolding for domain ontologies — more specific domain ontologies can be made interoperable by using and tieing their more general concepts into the UMBEL structure
- Inferencing — the UMBEL reference concept structure is designed for inferencing, which supports better semantic search and look-ups
- Semantic tagging — UMBEL, and ontologies mapped to it, can be used as input bases to ontology-based information extraction (OBIE) for tagging text or documents; UMBEL’s “semsets” broaden these matches and can be used across languages
- Linked data mining — via the reference ontology, direct and related concepts may be retrieved and mined and then related to one another
- Creating computable knolwedge bases — with complete mappings to key portions of a knowledge base, say, for Wikipedia articles, it is possible to use the UMBEL graph structure to create a computable knowledge source, with follow-on benefits in artificial intelligence and KB testing and improvements, and
- Categorizing instances and named entities — UMBEL can bring a consistent framework for typing entities and relating their descriptive attributes to one another.
UMBEL is being developed and refined via large-scale use cases. A number of improvements have been brought to the system to make it more testable, manageable, and flexible.
The first improvement was to introduce the so-called SuperTypes to UMBEL. All UMBEL reference concepts are assigned to one or more of 32
SuperTypes, organized into nine dimensions (details may be found here). The four
SuperTypes of Attributes, Abstract-level, Entities and Topics/Categories are designed to be fully non-disjoint, and do not participate in any disjoint assertions. The remaining 28
SuperTypes are designed to be as disjoint as possible:
|Natural World||Natural Phenomena|
|Protists & Fungus|
|Finance & Economy|
|Food or Drink|
|Notations & References|
To make UMBEL more tractable, we have also modularized it into ‘core’, ‘geo’, ‘entities’, and ‘attributes’ modules (the latter two modules being added in this new release). The modules can be swapped out with other external options or left out of analysis if not needed for a given domain interest. We also have formal mappings to other important external reference sets such as Wikipedia, OpenCyc, schema.org, the DBpedia ontology,GeoNames and PROTON. UMBEL’s GitHub site provides these mappings.
Beginning with version 1.10, we also added a new UMBEL generator written in Clojure that allows the entire system to be built and tested from a series of simple input files. We are now using this system aggressively to discover gaps and mis-assignments in the UMBEL structure, as well as to achieve balance in scope and coverage. The system ties into the OWL API for certain tests and capabilities (UMBEL is OWL 2-compliant).
Still a Work in Progress
Though UMBEL retains its same mission as when the system was first formulated eight years ago, we also see its role expanding. The two key areas of expansion are in UMBEL’s use to model and map instance data attributes and in acting as a computable overlay for Wikipedia (and other knowledge bases). These two areas of expansion are still a work in progress.
This UMBEL version 1.20 marks the first expression of the Attributes Ontology. While we have organized what already had existed in attribute concepts (that is, those concepts that capture the descriptive data related to how to characterize instance records), some gaps remain in both UMBEL and the source Cyc. Using the new ontology to map against the properties in the DBpedia and schema.org vocabularies is the next priority. These direct use cases are needed to ground the ontology in important, real-world information markup systems. We will also be looking at linking to an existing units and measurements ontology such as QUDT. There likely will need to be a series of releases over time to capture and test these uses.
The mapping to Wikipedia is now about 72% complete. While we are testing automated mapping mechanisms, because of its central role we also need to vet all UMBEL-Wikipedia mapping assignments. This effort is pointing out areas of UMBEL that are over-specified, under-specified, and sometimes duplicative. By placing UMBEL in an intermediate position between Cyc and Wikipedia we are finding differences and gaps on both ends, as well as gaps within UMBEL itself. Our goal is to get to a 100% coverage point with Wikipedia, and then to exercise the structure for machine learning and other tests against the KB. These efforts will enable us to enhance the semsets in UMBEL as well as to move toward multilingual versions. This effort, too, is still a work in progress.
Despite these desired enhancements, we are using all aspects of UMBEL and its mappings to both aid these expansions and to test the existing mappings and structure. These efforts are proving the virtuous circle of improvements that is at the heart of UMBEL’s purposes.
Where to Get UMBEL and Learn More
The UMBEL Web site provides various online tools and Web services for exploring and using UMBEL. The UMBEL GitHub site is where you can download the UMBEL Vocabulary or the UMBEL Reference Concept ontology, both under a Creative Commons Attribution 3.0 license. Other documents and backup are also available from that location.
Technical specifications for UMBEL and its various annexes are available from the UMBEL wiki site. You can also download a PDF version of the specifications from there. You are also welcomed to participate on the UMBEL mailing list or LinkedIn group.