Posted:June 6, 2016

Rationales for Typology Designs in Knowledge Bases

Design is Aimed to Improve Computability

In the lead up to our most recent release of UMBEL, I began to describe our increasing reliance on the use of typologies. In this article, I’d like to expand on our reasons for this design and the benefits we see.

‘Typology’ is not a common term within semantic technologies, though it is used extensively in such fields as archaeology, urban planning, theology, linguistics, sociology, statistics, psychology, anthropology and others. In the base semantic technology language of RDF, a ‘type’ is what is used to declare an instance of a given class. This is in keeping with our usage, where an instance is a member of a type.

Strictly speaking, ‘typology’ is the study of types. However, as used within the fields noted, a ‘typology’ is the result of the classification of things according to their characteristics. As stated by Merriam Webster, a ‘typology’ is “a system used for putting things into groups according to how they are similar.” Though some have attempted to make learned distinctions between typologies and similar notions such as classifications or taxonomies [1], we think this idea of grouping by similarity is the best way to think of a typology.

In Structured Dynamics‘ usage as applied to UMBEL and elsewhere, we are less interested in the sense of ‘typology’ as comparisons across types and more interested in the classification of types that are closely related, what we have termed ‘SuperTypes’. In this classification, each of our SuperTypes gets its own typology. The idea of a SuperType, in fact, is exactly equivalent to a typology, wherein the multiple entity types with similar essences and characteristics are related to one another via a natural classification. I speak elsewhere how we actually go about making these distinctions of natural kinds [2].

In this article, I want to stand back from how a typology is constructed to deal more about their use and benefits. Below I discuss the evolution of our typology design, the benefits that accrue from the ‘typology’ approach, and then conclude with some of the application areas to which this design is most useful. All of this discussion is in the context of our broader efforts in KBAI, or knowledge-based artificial intelligence.

Evolution of the Design

I wish we could claim superior intelligence or foresight in how our typology design came about, but it was truthfully an evolution of needing to deal with pragmatic concerns in our use of UMBEL over the past near-decade. The typology design has arisen from the intersection of: 1) our efforts with SuperTypes, and creating a computable structure that uses powerful disjoint assertions; 2) an appreciation of the importance of entity types as a focus of knowledge base terminology; and 3) our efforts to segregate entities from other key constructs of knowledge bases, including attributes, relations and annotations. Though these insights may have resulted from serendipity and practical use, they have brought a new understanding of how best to organize knowledge bases for artificial intelligence uses.

The Initial Segreation into SuperTypes

We first introduced SuperTypes into UMBEL in 2009 [3]. The initiative arose because we observed about 90% of the concepts in UMBEL were disjoint from one another. Disjoint assertions are computationally efficient and help better organize a knowledge graph. To maximize these benefits we did both top-down and bottom-up testing to derive our first groupings of SuperTypes into 29 mostly disjoint types, with four non-disjoint (or cross-cutting and shared) groups [3]. Besides computational efficiency and its potential for logical operations, we also observed that these SuperTypes could also aid our ability to drive display widgets (such as being able to display geolocational types on maps).

All entity classes within a given SuperType are thus organized under the SuperType (ST) itself as the root. The classes within that ST are then organized hierarchically, with children classes having a subClassOf relation to their parent. By the time of UMBEL’s last release [4], this configuration had evolved into the following 31 largely disjoint SuperTypes, organized into 10 or so clusters or “dimensions”:

Constituents
	Natural Phenomena
	Area or Region
	Location or Place
	Shapes
	Forms
Situations
Time-related
	Activities
	Events
	Times
Natural Matter
	Atoms and Elements
	Natural Substances
	Chemistry
Organic Matter
	Organic Chemistry
	Biochemical Processes
Living Things
	Prokaryotes
	Protists & Fungus
	Plants
	Animals
	Diseases
Agents
	Persons
	Organizations
	Geopolitical
Artifacts
	Products
	Food or Drink
	Drugs
	Facilities
Information
	Audio Info
	Visual Info
	Written Info
	Structured Info
Social
	Finance & Economy
	Society

Current SuperType Structure of UMBEL

We also used the basis in SuperTypes to begin cleaving UMBEL into modules, with geolocational types being the first to be separated. We initially began splitting into modules as a way to handle UMBEL’s comparatively large size (~ 30K concepts). As we did so, however, we also observed that most of the SuperTypes could be also be segregated into modules. This architectural view and its implications were another reason leading to the eventual typology design.

A Broadening Appreciation for the Pervasiveness of Entity Types

The SuperType tagging and possible segregation of STs into individual modules led us to review other segregations and tags. Given that the SuperTypes were all geared to entities and entity types — and further represented about 90% of all concepts in UMBEL — we began to look at entities as a category with more care and attention. This analysis took us back to the beginnings of entity recognition and tagging in natural language processing. We saw the progression of understanding from named entities and just a few entity types, to the more recent efforts in so-called fine-grained entity recognition [5].

What was blatantly obvious, but which had been previously overlooked by us and other researchers investigating entity types, was that most knowledge graphs (or upper ontologies) were themselves made of largely entity types [5]. In retrospect, this should not be surprising. Most knowledge graphs deal with real things in the world, which, by definition, tend to be entities. Entities are the observable, often nameable, things in the world around us. And how we organize and refer to those entities — that is, the entity types — constitutes the bulk of the vocabulary for a knowledge graph.

We can see this progression of understanding moving from named entities and fine-grained entity types, all the way through to an entity typology — UMBEL’s SuperTypes — that then becomes the tie-in point for individual entities (the ABox):

Evolving Sophistication of Entity Types

The key transition is moving from the idea of discrete numbers of entity types to a system and design that supports continuous interoperability through an “accordion-like” typology structure.

The General Applicability of ‘Typing’ to All Aspects of Knowledge Bases

The “type-orientation” of a typology was also attractive because it offers a construct that can be applied to all other (non-entity) parts of the knowledge base. Actions can be typed; attributes can be typed; events can be typed; and relations can be typed. A mindset around natural kinds and types helps define the speculative grammar of KBAI (a topic of a next article). We can thus represent these overall structural components in a knowledge base as:

Typology View of a Knowledge Base

The shading is in reference to that which is external to the scope of UMBEL.

The intersection of these three factors — SuperTypes, an “accordion” design for continuous entity types, and overall typing of the knowledge base — set the basis for how to formalize an overall typology design.

Formalizing the Typology Design

We have first applied this basis to typologies for entities, based on the SuperTypes. Each SuperType becomes its own typology with natural boundaries and a hierarchical structure. No instances are allowed in the typology; only types.

Initial construction of the typology first gathers the relevant types (concepts) and automatically evaluates those concepts for orphans (unconnected concepts) and fragments (connected portions missing intermediary parental concepts). For the initial analysis, there are likely multiple roots, multiple fragments, and multiple orphans. We want to get to a point where there is a single root and all concepts in the typology are connected. Source knowledge bases are queried for the missing concepts and evaluated again in a recursive manner. Candidate placements are then written to CSV files and evaluated with various utilities, including crucially manual inspection and vetting. (Because the system bootstraps what is already known and structured in the system, it is important to build the structure with coherent concepts and relations.)

Once the overall candidate structure is completed, it is then analyzed against prior assignments in the knowledge base. ST disjoint analysis, coherent inferencing, and logical placement tests again prompt the creation of CSV files that may be viewed and evaluated with various utilities, but, again, ultimately manually vetted.

The objective of the build process is a fully connected typology that passes all coherency, consistency, completeness and logic tests. If errors are subsequently discovered, the build process must be run again with possible updates to the processing scripts. Upon acceptance, each new type added to a typology should pass a completeness threshold, including a definition, synonyms, guideline annotations, and connections. The completed typology may be written out in both RDF and CSV formats. (The current UMBEL and its typologies are available here.)

Integral to the design must be build, testing and maintenance routines, scripts, and documentation. Knowledge bases are inherently open world [6], which means that the entities and their relationships and characteristics are constantly growing and changing due to new knowledge underlying the domain at hand. Such continuous processing and keeping track of the tests, learnings and workflow steps place a real premium on literate programming, about which Fred Giasson, SD’s CTO, is now writing [7].

Because of the very focused nature of each typology (as set by its root), each typology can be easily incorporated or excised from a broader structure. Each typology is rather simple in scope and simple in structure, given its hierarchical nature. Each typology is readily maintained and built and tested. Typologies pose relatively small ontological commitments.

Benefits of the Design

The simple bounding and structure of the typology design makes each typology understandable merely by inspecting its structure. But the typologies can also be read into programs such as Protégé in order to inspect or check complete specifications and relationships.

Because each typology is (designed to be) coherent and consistent, new concepts or structures may be related to any part of its hierarchical design. This gives these typologies an “accordion-like” design, similar to the multiple levels and aggregation made possible by an accordion menu:

An ‘Accordion’ Design Accommodates Different Granularities

The combination of logical coherence with a flexible, accordion structure gives typologies a unique set of design benefits. Some have been mentioned before, but to recap they are:

Computable

Each type has a basis — ranging from attributes and characteristics to hierarchical placement and relationship to other types — that can inform computability and logic tests, potentially including neighbor concepts. Ensuring that type placements are accurate and meet these tests means that the now-placed types and their attributes may be used to test the placement and logic of subsequent candidates. The candidates need not be only internal typology types, but may also be used against external sources for classification, tagging or mapping.

Because the essential attributes or characteristics across typologies in an entire domain can differ broadly — such as entities v attributes, living v inanimate things, natural things v man-made things, ideas v physical objects, etc. — it is possible to make disjointedness assertions between entire groupings of natural classes. Disjoint assertions, combined with logical organization and inference, provide a typology design that lends itself to reasoning and tractability.

The internal process to create these typologies also has the beneficial effect of testing placements in the knowledge graph and identifying gaps in the structure as informed by fragments and orphans. This computability of the structure is its foundational benefit, since it determines the accuracy of the typology itself and drives all other uses and services.

Pluggable and Modular

Since each typology has a single root, it is readily plugged into or removed from the broader structure. This means the scale and scope of the overall system may be easily adjusted, and the existing structure may be used as a source for extensions (see next). Unlike more interconnected knowledge graphs (which can have many network linkages), typologies are organized strictly along these lines of shared attributes, which is both simpler and also provides an orthogonal means for investigating type-class membership.

Interoperable

The idea of nested, hierarchical types organized into broad branches of different typologies also provides a very flexible design for interoperating with a diversity of world views and degrees of specificity. A typology design, logically organized and placed into a consistent grounding of attributes, can readily interoperate with these different world views. So far, with UMBEL, this interoperable basis is limited to concepts and things, since only the entity typologies have been initially completed. But, once done, the typologies for attributes and relations will extend this basis to include full data interoperability of attribute:value pairs.

Extensible

A typology design for organizing entities can thus be visualized as a kind of accordion or squeezebox, expandable when detail requires, or collapsed to more coarse-grained when relating to broader views. The organization of entity types also has a different structure than the more graph-like organization of higher-level conceptual schema, or knowledge graphs. In the cases of broad knowledge bases, such as UMBEL or Wikipedia, where 70 percent or more of the overall schema is related to entity types, more attention can now be devoted to aspects of concepts or relations.

Each class within the typology can become a tie-in point for external information, providing a collapsible or expandable scaffolding (the ‘accordion’ design). Via inferencing, multiple external sources may be related to the same typology, even though at different levels of specificity. Further, very detailed class structures can also be accommodated in this design for domain-specific purposes. Moreover, because of the single tie-in point for each typology at its root, it is also possible to swap out entire typology structures at once, should design needs require this flexibility.

Testable and Maintainable

The only sane way to tackle knowledge bases at these structural levels is to seek consistent design patterns that are easier to test, maintain and update. Open world systems must embrace repeatable and largely automated workflow processes, plus a commitment to timely updates, to deal with the constant, underlying change in knowledge.

Listing of Broad Application Areas

Some of the more evident application areas for this design — and in keeping with current client and development activities for Structured Dynamics — are the following:

Domain extension — the existing typologies and their structure provide a ready basis for adding domain details and extensions;
Tagging — there are many varieties of tagging approaches that may be driven from these structures, including, with the logical relationships and inferencing, ontology-based information tagging;
Classification — the richness of the typology structures means that any type across all typologies may be a possible classification assignment when evaluating external content, if the overall system embracing the typologies is itself coherent;
Interoperating datasets — the design is based on interoperating concepts and datasets, and provides more semantic and inferential means for establishing MDM systems;
Machine learning (ML) training — the real driver behind this design is lowering the costs for supervised machine learning via more automated and cost-effective means of establishing positive and negative training sets. Further, the system’s feature richness (see next) lends itself to unsupervised ML techniques as well; and
Rich feature set — a design geared from the get-go to emphasize and expose meaningful knowledge base features [8] perhaps opens up many new fruitful avenues for machine learning and other AI. More expressed structure may help in the interpretability of latent feature layers in deep learning. In any case, more and coherent structure with testability can only be goodness for KBAI going forward.

One Building Block Among Many

The progressions and learning from the above were motivated by the benefits that could be seen with each structural change. Over nearly a decade, as we tried new things, structured more things, we discovered more things and adapted our UMBEL design accordingly. The benefits we see from this learning are not just additive to benefits that might be obtained by other means, but they are systemic. The ability to make knowledge bases computable — while simultaneously increasing the features space for training machine learners — at much lower cost should be a keystone enabler at this particular point in AI’s development. Lowering the costs of creating vetted training sets is one way to improve this process.

Systems that can improve systems always have more leverage than individual innovations. The typology design outlined above is the result of the classification of things according to their shared attributes and essences. The idea is that the world is divided into real, discontinuous and immutable ‘kinds’. Expressed another way, in statistics, typology is a composite measure that involves the classification of observations in terms of their attributes on multiple variables. In the context of a global KB such as Wikipedia, about 25,000 entity types are sufficient to provide a home for the millions of individual articles in the system.

As our next article will discuss, Charles Sanders Peirce’s consistent belief that the real world can be logically conceived and organized provides guidance for how we can continue to structure our knowledge bases into computable form. We now have a coherent base for treating types and natural classes as an essential component to that thinking. These insights are but one part of the KB innovations suggested by Peirce’s work.

[1] See, for example, Alberto Marradi, 1990. “Classification, Typology, Taxonomy“, Quality & Quantity 24, no. 2 (1990): 129-157.

[2] M.K. Bergman, 2015. “‘Natural Classes’ in the Knowledge Web,” AI3:::Adaptive Information blog, July 13, 2015.

[3] M.K. Bergman, 2009. “‘SuperTypes’ and Logical Segmentation of Instances,” AI3:::Adaptive Information blog, September 2, 2009.

[4] umbel.org, “New, Major Upgrade of UMBEL Released,” UMBEL press release, May 10, 2016 (for UMBEL v. 1.50 release)

[5] M.K. Bergman, 2016. “How Fine Grained Can Entity Types Get?,” AI3:::Adaptive Information blog, March 8, 2016.

[6] M.K. Bergman, 2012. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009.

[7] See http://fgiasson.com/blog/index.php/category/programming/literate-programming/; the last update was, Frédérick Giasson, 2016. “Creating and Running Unit Tests Directly in Source Files with Org-mode“, fgiasson.com/blog, May 30, 2016.

[8] M.K. Bergman, 2015. “A (Partial) Taxonomy of Machine Learning Features,” AI3:::Adaptive Information blog, November 23, 2015.

Posted:June 1, 2016

Squatty Potty and Millenials

Checking Out Where Things Sit from Upwind

I have distinct memories of drinking booze and swimming in the faculty pool, chasing out the National Guard, and facing Capt. Joel Honey in downtown Santa Barbara. UCSB and Isla Vista, over multiple occasions, were but one of many focal points for anti-war demonstrations during Vietnam. There was actually more than one summer of love, and it was hard to cut my hair. (It was even harder to grow a beard.) I even lived in a treehouse for a summer. My generation was politically active, engaged, high, and mostly free from STDs. Peace, love, dope.

We were empowered with new political and social consciences and better ideas and ideals than our parents. The whole materialistic thing was a drag. We were hip, we fed our heads, and we burned our tighty whiteys (or bras, depending). It seems so innocent now that we could actually stick out our thumbs and hitchhike across the country. Everything was actually pretty cool, except for crabs.

Then, this past Christmas, my kids and their partners (used to be what we called significant others), introduced my wife (partner) and me to a new perspective: Squatty Potty. We howled like hyenas watching the video, but it was only a laugh at that time.

Now, my kids (with their partners), all heading to get their medical degrees, and so should know something about bodily stuff, have raised again the better angle-of-attack of Squatty Potty. They are faithful adherents (or exherents, depending on your perspective). Only now the sentiment is not laughs; it is reverence. It turns out that my generation, and our parental generation of squares, and perhaps for generations, have not really known how to shit. We may have been wasting TP, too (but that is largely unspoken).

It was this second reference to Squatty Potty that caused me in repose to wonder: What was it with these millenials? Why aren’t they hip like we used to be? Why do I feel so constipated?

See, these millenials are constantly talking about local organic food, clean air, keeping the planet in balance, a clean colon. In just a generation, we have advanced from politics to emissions, global and personal. I actually think this is progress. We have embraced the global, but turned it inward. We can monitor our progress to a polyp-free alimentary canal!

So, there is a reason that Millenials have a smug look when you see them on the street. They know how to breathe, eat and shit like no generation before them. We have progressed as a society from one of ideas and political protest, to how to feed and void ourselves. Still, I think getting at the basics is a good idea. This generation coming down the chute knows what it takes to keep a clear pathway ahead.

Posted:May 11, 2016

New, Major Upgrade of UMBEL Released

Version 1.50 Fully Embraces a Typology Design, Gets Other Computability Improvements

The year since the last major release of UMBEL (Upper Mapping and Binding Exchange Layer) has been spent in a significant re-think of how the system is organized. Four years ago, in version 1.05, we began to split UMBEL into a core and a series of swappable modules. The first module adopted was in geographical information; the second was in attributes. This design served us well, but it was becoming apparent that we were on a path of multiple modules. Each of UMBEL’s major so-called ‘SuperTypes‘ — that is, major cleavages of the overall UMBEL structure that are largely disjoint from one another, such as between Animals and Facilities — were amenable to the module design. This across-the-board potential cleavage of the UMBEL system caused us to stand back and question whether a module design alone was the best approach. Ultimately, after much thought and testing, we adopted instead a typology design that brought additional benefits beyond simple modularity.

Today, we are pleased to announce the release of these efforts in UMBEL version 1.50. Besides standard release notes, this article discusses this new typology design, and explains its uses and benefits.

Basic UMBEL Background

The Web and enterprises in general are characterized by growing, diverse and distributed information sources and data. Some of this information resides in structured databases; some resides in schema, standards, metadata, specifications and semi-structured sources; and some resides in general text or media where the content meaning is buried in unstructured form. Given these huge amounts of information, how can one bring together what subsets are relevant? And, then for candidate material that does appear relevant, how can it be usefully combined or related given its diversity? In short, how does one go about actually combining diverse information to make it interoperable and coherent?

UMBEL thus has two broad purposes. UMBEL’s first purpose is to provide a general vocabulary of classes and predicates for describing and mapping domain ontologies, with the specific aim of promoting interoperability with external datasets and domains. UMBEL’s second purpose is to provide a coherent framework of reference subjects and topics for grounding relevant Web-accessible content. UMBEL presently has about 34,000 of these reference concepts drawn from the Cyc knowledge base, organized into 31 mostly disjoint SuperTypes.

The grounding of information mapped by UMBEL occurs by common reference to the permanent URIs (identifiers) for UMBEL’s concepts. The connections within the UMBEL upper ontology enable concepts from sources at different levels of abstraction or specificity to be logically related. Since UMBEL is an open source extract of the OpenCyc knowledge base, it can also take advantage of the reasoning capabilities within Cyc.

Diagram showing linked data datasets. UMBEL is near the hub, below and to the right of the central DBpedia.

UMBEL’s vocabulary is designed to recognize that different sources of information have different contexts and different structures, and meaningful connections between sources are not always exact. UMBEL’s 34,000 reference concepts form a knowledge graph of subject nodes that may be related to external classes and individuals (instances and entities). Via this coherent structure, we gain some important benefits:

Mapping to other ontologies — disparate and heterogeneous datasets and ontologies may be related to one another by mapping to the UMBEL structure
A scaffolding for domain ontologies — more specific domain ontologies can be made interoperable by using the UMBEL vocabulary and tieing their more general concepts into the UMBEL structure
Inferencing — the UMBEL reference concept structure is coherent and designed for inferencing, which supports better semantic search and look-ups
Semantic tagging — UMBEL, and ontologies mapped to it, can be used as input bases to ontology-based information extraction (OBIE) for tagging text or documents; UMBEL’s “semsets” broaden these matches and can be used across languages
Linked data mining — via the reference ontology, direct and related concepts may be retrieved and mined and then related to one another
Creating computable knowledge bases — with complete mappings to key portions of a knowledge base, say, for Wikipedia articles, it is possible to use the UMBEL graph structure to create a computable knowledge source, with follow-on benefits in artificial intelligence and KB testing and improvements, and
Categorizing instances and named entities — UMBEL can bring a consistent framework for typing entities and relating their descriptive attributes to one another.

UMBEL is written in the semantic Web languages of SKOS and OWL 2. It is a class structure used in linked data, along with other reference ontologies. Besides data integration, UMBEL has been used to aid concept search, concept definitions, query ranking, ontology integration, and ontology consistency checking. It has also been used to build large ontologies and for online question answering systems [1].

Including OpenCyc, UMBEL has about 65,000 formal mappings to DBpedia, PROTON, GeoNames, and schema.org, and provides linkages to more than 2 million Wikipedia pages (English version). All of its reference concepts and mappings are organized under a hierarchy of 31 different SuperTypes, which are mostly disjoint from one another. Development of UMBEL began in 2007. UMBEL was first released in July 2008. Version 1.00 was released in February 2011.

Summary of Version 1.50 Changes

These are the principal changes between the last public release, version 1.20, and this version 1.50. In summary, these changes include:

Removed all instance or individual listings from UMBEL; this change does NOT affect the punning used in UMBEL’s design (see Metamodeling in Domain Ontologies)
Re-aligned the SuperTypes to better support computability of the UMBEL graph and its resulting disjointedness
These SuperTypes were eliminated with concepts re-assigned: Earthscape, Extraterrestrial, Notations and Numbers
These new SuperTypes were introduced: AreaRegion, AtomsElements, BiologicalProcesses, Forms, LocationPlaces, and OrganicChemistry, with logically reasoned assignments of RefConcepts
The Shapes SuperType is a new ST that is inherently non-disjoint because it is shared with about half of the RefConcepts
The Situations is an important ST, overlooked in prior efforts, that helps better establish context for Activities and Events
Made re-alignments in UMBEL’s upper structure and introduced additional upper-level categories to better accommodate these refinements in SuperTypes
A typology was created for each of the resulting 31 disjoint STs, which enabled missing concepts to be identified and added and to better organize the concepts within each given ST
The broad adoption of the typology design for all of the (disjoint) SuperTypes also meant that prior module efforts, specifically Geo and Attributes, could now be made general to all of UMBEL. This re-integration also enabled us to retire these older modules without affecting functionality
The tests and refinements necessary to derive this design caused us to create flexible build and testing scripts, documented via literate programming (using Clojure)
Updated all mappings to DBpedia, Wikipedia, and schema.org
Incorporated donated mappings to five additional LOV vocabularies [2]
Tested the UMBEL structure for consistency and coherence
Updated all prior UMBEL documentation
Expanded and updated the UMBEL.org Web site, with access and demos of UMBEL.

UMBEL’s SuperTypes

The re-organizations noted above have resulted in some minor changes to the SuperTypes and how they are organized. These changes have made UMBEL more computable with a higher degree of disjointedness between SuperTypes. (Note, there are also organizational SuperTypes that work largely to aid the top levels of the knowledge graph, but are explicitly designed to NOT be disjoint. Important SuperTypes in this category include Abstractions, Attributes, Topics, Concepts, etc. These SuperTypes are not listed below.)

UMBEL thus now has 31 largely disjoint SuperTypes, organized into 10 or so clusters or “dimensions”:

Constituents
	Natural Phenomena
	Area or Region
	Location or Place
	Shapes
	Forms
Situations
Time-related
	Activities
	Events
	Times
Natural Matter
	Atoms and Elements
	Natural Substances
	Chemistry
Organic Matter
	Organic Chemistry
	Biochemical Processes
Living Things
	Prokaryotes
	Protists & Fungus
	Plants
	Animals
	Diseases
Agents
	Persons
	Organizations
	Geopolitical
Artifacts
	Products
	Food or Drink
	Drugs
	Facilities
Information
	Audio Info
	Visual Info
	Written Info
	Structured Info
Social
	Finance & Economy
	Society

These disjoint SuperTypes provide the basis for the typology design described next.

The Typology Design

After a few years of working with SuperTypes it became apparent each SuperType could become its own “module”, with its own boundaries and hierarchical structure. Since across the UMBEL structure nearly 90% of the reference concepts are themselves entity classes, if these are properly organized, we can achieve a maximum of disjointness, modularity, and reasoning efficiency. Our early experience with modules pointed the way to a design for each SuperType that was as distinct and disjoint from other STs as possible. And, through a logical design of natural classes [3] for the entities in that ST, we could achieve a flexible, ‘accordion-like’ design that provides entity tie-in points from the general to the specific for each given SuperType. The design is effective for being able to interoperate across both fine-grained and coarse-grained datasets. For specific domains, the same design approach allows even finer-grained domain concepts to be effectively integrated.

All entity classes within a given SuperType are thus organized under the SuperType itself as the root. The classes within that ST are then organized hierarchically, with children classes having a subClassOf relation to their parent. Each class within the typology can become a tie-in point for external information, providing a collapsible or expandable scaffolding (the ‘accordion’ design). Via inferencing, multiple external sources may be related to the same typology, even though at different levels of specificity. Further, very detailed class structures can also be accommodated in this design for domain-specific purposes. Moreover, because of the single tie-in point for each typology at its root, it is also possible to swap out entire typology structures at once, should design needs require this flexibility.

We have thus generalized the earlier module design to where every (mostly) disjoint SuperType now has its own separate typology structure. The typologies provide the flexible lattice for tieing external content together at various levels of specificity. Further, the STs and their typologies may be removed or swapped out at will to deal with specific domain needs. The design also dovetails nicely with UMBEL’s build and testing scripts. Indeed, the evolution of these scripts via literate programming has also been a reinforcing driver for being able to test and refine the complete ST and typologies structure.

Still a Work in Progress

Though UMBEL retains its same mission as when the system was first formulated nearly a decade ago, we also see its role expanding. The two key areas of expansion are in UMBEL’s use to model and map instance data attributes and in acting as a computable overlay for Wikipedia (and other knowledge bases). These two areas of expansion are still a work in progress.

The mapping to Wikipedia is now about 85% complete. While we are testing automated mapping mechanisms, because of its central role we also need to vet all UMBEL-Wikipedia mapping assignments. This effort is pointing out areas of UMBEL that are over-specified, under-specified, and sometimes duplicative or in error. Our goal is to get to a 100% coverage point with Wikipedia, and then to exercise the structure for machine learning and other tests against the KB. These efforts will enable us to enhance the semsets in UMBEL as well as to move toward multilingual versions. This effort, too, is still a work in progress.

Despite these desired enhancements, we are using all aspects of UMBEL and its mappings to both aid these expansions and to test the existing mappings and structure. These efforts are proving the virtuous circle of improvements that is at the heart of UMBEL’s purposes.

Where to Get UMBEL and Learn More

The UMBEL Web site provides various online tools and Web services for exploring and using UMBEL. The UMBEL GitHub site is where you can download the UMBEL Vocabulary or the UMBEL Reference Concept ontology, both under a Creative Commons Attribution 3.0 license. Other documents and backup are also available from that location.

Technical specifications for UMBEL and its various annexes are available from the UMBEL wiki site. You can also download a PDF version of the specifications from there. You are also welcomed to participate on the UMBEL mailing list or LinkedIn group.

[1] See further https://en.wikipedia.org/wiki/UMBEL.

[2] Courtesy of Jana Vataščinová (University of Economics, Prague) and Ondřej Zamazal (University of Economics, Prague, COSOL project).

[3] See, for example, M.K. Bergman, 2015. “‘Natural Classes’ in the Knowledge Web,” AI3:::Adaptive Information blog, July 13, 2015.

Posted:April 5, 2016

‘Deep Graphs’: A New Framework for Network Analysis

New Method Appears Promising for Machine Learning, Feature Generation

An exciting new network analysis framework was published today. The paper, Deep Graphs – A General Framework to Represent and Analyze Heterogeneous Complex Systems Across Scales, presents the background information and derivation of methods applied to this new approach for analyzing networks [1]. The authors of the paper, Dominik Traxl, Niklas Boers and Jürgen Kurths, also released the open source DeepGraph network analysis package, written in Python, for undertaking and conducting the analysis. Detailed online documentation accompanies the entire package.

The basic idea behind Deep Graphs is to segregate graph nodes and edges into types, which form supernodes and superedges, respectively. These grouped types then allow the graph to be partitioned into lattices, which can be intersected (combinations of nodes and edges) into representing deeper graph structures embedded in the initial graph. The method can be applied to a graph representation of anything, since the approach is grounded in the graph primitives of nodes and edges using a multi-layer network (MLN) representation.

AI3 Pulse These deeper graph structures can themselves be used as new features for machine learning or other applications. A deep graph, which the authors formally define as a geometric partition lattice of the source graph, conserves the original information in the graph and allows it to be redistributed to the supernodes and superedges. Intersections of these may surface potentially interesting partitions of the graph that deserve their own analysis.

The examples the authors present show the suitability of the method for time-series data, using precipitation patterns in South America. However, as noted, the method applies to virtually any data that can be representated as a graph.

Though weighted graphs and other techniques have been used, in part, for portions of this kind of analysis in the past, this appears to be the first generalized method applicable to the broadest ways to aggregate and represent graph information. The properties associated with a given node may similarly be representated and aggregated. The aggregation of attributes may provide an additional means for mapping and relating external datasets to one another.

There are many aspects of this approach that intrigue us here at Structured Dynamics. First, we are always interested in network and graph analytical techniques, since all of our source schema are represented as knowledge graphs. Second, our specific approach to knowledge-based artificial intelligence places a strong emphasis on types and typologies for organizing entities (nodes and event relations) and we also separately segregate attribute property information [2]. And, last, finding embedded superstructures within the source graphs should also work to enhance the feature sets available for supervised machine learning.

We will later post our experiences in working with this promising framework.

[1] Dominik Traxl, Niklas Boers and Jürgen Kurths, “Deep Graphs – A General Framework to Represent and Analyze Heterogeneous Complex Systems Across Scales“, arXiv:1604.00971, April 5, 2016. To be published in Chaos: An Interdisciplinary Journal of Nonlinear Science.

[2] See M. K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” from AI3:::Adaptive Information blog, November 17, 2014.

Posted:March 28, 2016

Withstanding the Test of Time

Long-lost Global Warming Paper is Still Pretty Good

My first professional job was being assistant director and then project director for a fifty-year look at the future of coal use by the US Environmental Protection Agency. The effort, called the Coal Technology Assessment (CTA), was started under the Carter Administration in the late 1970s, and then completed after Reagan took office in 1981. That era also spawned the Congressional Office of Technology Assessment. Trying to understand and forecast technological change was a big deal at that time.

We produced many, many reports from the CTA program, some of which were never published because of politics and whether they were at odds or not with official policies of one or the other administration. Nonetheless, we did publish quite a few reports. Perhaps it is the sweetness of memory, but I also recollect we did a pretty good job. Now that more than 35 years have passed, it is possible to see whether we did a good job or not in our half-century forecasts.

The CTA program was the first to publish an official position of EPA on global warming [1], which we also backed up with a more formal academic paper [2]. I have thought much of that paper on occasion over the years, but I did not have a copy myself and only had a memory, but not hard copy, of the paper.

Last week, however, I was contacted by a post-doctoral researcher in Europe trying to track down early findings and recollections of some of the earliest efforts on global climate change. She had a copy of our early paper and was kind enough to send me a copy. I have since been able to find other copies online [2].

In reading over the paper again, I am struck by two things. First, the paper is pretty good, and still captures (IMO) the uncertainty of the science and how to conduct meaningful policy in the face of that uncertainty. And, second, but less positive, is the sense of how little truly has gotten done in the intervening decades. This same sense of déjà vu all over again applies to many of the advanced energy technologies — such as fuel cells, photovoltaics, and passive solar construction — we were touting at that time.

Of course, my own career has moved substantially from energy technologies and policy to a different one of knowledge representation and artificial intelligence. But, it is kind of cool to look back on the passions of youth, and to see that my efforts were not totally silly. It is also kind of depressing to see how little has really changed in nearly four decades.

[1] M.K. Bergman, 1980. “Atmospheric Pollution: Carbon Dioxide,” Environmental Outlook — 1980, Strategic Analysis Group, U.S. Environmental Protection Agency, EPA 600/8 80 003, July 1980, pp. 225-261.

[1] Kan Chen, Richard C. Winter, and Michael K. Bergman, 1980. “Carbon dioxide from fossil fuels: Adapting to uncertainty.” Energy Policy 8, no. 4 (1980): 318-330.

Main Links

Search

Author: Mike Bergman