Posted:September 27, 2016

KR Models Need to Represent Knowledge via Context and PerspectiveCognonto

What does the idea of knowledge mean to you?

The entry for knowledge on Wikipedia says:

“Knowledge is a familiarity, awareness or understanding of someone or something, such as facts, information, descriptions, or skills, which is acquired through experience or education by perceiving, discovering, or learning. Knowledge can refer to a theoretical or practical understanding of a subject. It can be implicit (as with practical skill or expertise) or explicit (as with the theoretical understanding of a subject); it can be more or less formal or systematic. [1]

OK, that’s a lot of words. Rather than parse its specifics, let’s look at the basis of “knowledge” using a variety of simple examples.

Perplexing Perspectives and Confusing Contexts

Let’s take for an example the statement, the sky is blue. We can accept this as a factual statement (thus, assumed knowledge). But, we also know that the sky might be dark or black, if it is night. Or the sky may be gray if it is cloudy. Indeed, when we hear the statement that the sky is blue, if we believe the source or can see the sky for ourselves, then we can readily infer that the observation is occurring during daylight, under a clear sky. Our acceptance of an assertion as factual or being true carries with it the implications of its related contexts. On the other hand, were the simple statement to be le ciel est bleu, and if we did not know French, we would not know what to make of the statement, true or false, with context or not, even if all of the assertions are still correct.

Rabbit or Duck? This simple example carries with it two profound observations. First, context helps to determine whether we believe or not a given statement, and if we believe it, what the related context implied by the statement might be. Second, how this information is conveyed to us is via symbols — in this case, the English language, but applicable to all human and artificial and formal notations like mathematics as well — which we may or may not be able to interpret correctly. If I am monolingual in English and I see French statements, I do not know what the symbols mean.

Knowledge is that which is “true” in a given context or perspective. Is it a duck, or is it a rabbit? Knowledge may reside solely in our own minds, and not be part of “common knowledge”. But, ultimately, even personal beliefs not held by others only become “knowledge” that we can rely upon in our discourse once others have “acknowledged” the truth. Forward-looking thinkers like Copernicus or Galileo or Einstein may have understood something in their own minds not yet shared by others, but we do not “acknowledge” those understandings as knowledge until we can share and discuss the insight. (That is, what scientists would call independent verification.) In this manner, knowledge, like language and symbol-creation, is inherently a social phenomena. If I coin a new word, but no one else understands what I am saying, that is not part of knowledge; that is gibberish.

OK, let’s take another example. This time we’ll take the simple case of a flower. What this panel of images shows is how a composite flower may be seen by first humans, then bees and then butterflies:

How Humans, Bees and Butterflies See A Daisy, courtesy of Photography of the Invisible World

How Humans, Bees and Butterflies See A Daisy

In this example [2], we are highlighting the fact that different insects see objects with different wavelengths of light than we (humans) do. Bees see much more in the ultraviolet spectrum. The daisy flower knows how to attract this most important pollinator.

In a different example focused on human perception alone, look at these two panels on the left:

How Humans, Bees and Butterflies See A Daisy, courtesy of Photography of the Invisible World

Color Blind, on left; What is ‘bank’?, on right

The middle picture shows us how the color-blind person “sees”, with reds and yellows washed out [3].

Or, let’s take another example, this case the black-and-white word for ‘bank’. We can see this word, and if we speak English, even recognize it, but what does this symbol mean? A financial institution? The shore of a river? Turning an airplane? A kind of pool shot? Tending a fire for the evening?

In all of these examples, there is an actual object that is the focus of attention. But what we “know” about this object depends on what we perceive or understand and who or what is doing the perceiving and the understanding. We can never fully “know” the object because we can never encompass all perspectives and interpretations.

KR Models Need to Represent Knowledge via Context and Perspective

Every knowledge structure used for knowledge representation (KR) or knowledge-based artificial intelligence (KBAI) needs to be governed by some form of conceptual schema. In the semantic Web space, such schema are known as ontologies, since they attempt to capture the nature or being (Greek ὄντως, or ontós) of the knowledge domain at hand. Because the word ‘ontology’ is a bit intimidating, a better variant has proven to be the knowledge graph (because all semantic ontologies take the structural form of a graph). In Cognonto‘s KBAI efforts, we tend to use the terms ontology and knowledge graph interchangeably.

In general knowledge domains, such schema are also known as upper ontologies. However, one of the first things we see with existing ontologies is that they tend to be organized around a single, dyadic dimension, even though guided by a diversity of conceptual approaches. In the venerable Cyc knowledge structure, one of the major divisions is between what is tangible and what is intangible. In BFO, the Basic Formal Ontology, the split is between a “snapshot” view of the world (continuant) and its entities versus a “spanning” view that is explicit about changes in things over time (occurrent). Other upper ontologies have different dyadic splits, such as abstract v. physical, perduant v. endurant, dependent v. independent, particulars v. universals, or determinate v. indeterminate [4]. I’m sure there are others.

Ontologies are designed for specific purposes, and the bases for these splits in other ontologies have their rationales and uses. But in Cognonto’s case of needing to design an ontology whose specific purpose is knowledge representation, we need to explicitly model the nature of knowledge. Knowledge is not black and white, nor is it shades of gray along a single dimension. Knowledge is an incredibly rich construct intimately related to context and perspective. The minimum cardinality that can provide such perspective is three.

Three Aspects of a SignIf we return to the examples that began this article, we begin to see the interaction of three separate things. We have the actual thing itself, be it an object or a phenomenon. It is what is is. Then, we have a way that that thing is conveyed or represented. It might be an image, a sound, a perception, a finger pointing at it, or a symbol (or combination of symbols such as a description) of it. Then we have how that representation is perceived. It is in the interplay of these three separate things that something is “understood” or becomes “knowledge” (that is, a sign). This triadic view of the world was first articulated by Charles Sanders Peirce (1839-1914) (pronounced “purse”), the great American logician, philosopher and polymath.

Peirce’s logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an example of this process [5].

A given sign is a representation amongst the triad of the sign itself (which Peirce called a representamen, the actual signifying item that stands in a well-defined kind of relation to the two other things), its object and its interpretant. The object is the actual thing itself. The interpretant is how the agent or the perceiver of the sign understands and interprets the sign. Depending on the context and use, a sign (or representamen) may be either an icon (a likeness), an indicator or index (a pointer or physical linkage to the object) or a symbol (understood convention that represents the object, such as a word or other meaningful signifier).

An interpretant in its barest form is a sign’s meaning, implication, or ramification. For a sign to be effective, it must represent an object in such a way that it is understood and used again. This makes the assignment and use of signs a community process of understanding and acceptance [6], as well as a truth-verifying exercise of testing and confirming accepted associations (such as the meanings or words or symbols).

The key aspect of signs for Peirce, though, is the ongoing process of interpretation and reference to further signs, a process he called semiosis. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants.

Relation to Cognonto’s KBpedia Knowledge Ontology

The essence of knowledge is that it is ever-growing and expandable. New insights bring new relations and new truths. The structures we use to represent this knowledge must themselves adapt and reflect the best of our current, testable understandings. Peirce saw the trichotomous parts of his sign logic as the fewest “decomposable” needed to model the real world; we would call these “primitives” in modern terminology. Robert Burch has called Peirce’s ideas of “indecomposability” the ‘Reduction Thesis’ [7]. The basic thesis is that ternary relations suffice to construct any and all arbitrary relations, but that not all relations can be constructed from unary and binary relations alone. Threes are irreducible to capture the basis of knowledge.

With its express purpose to provide a sound basis for modeling knowledge, essential to knowledge-based artificial intelligence, Cognonto’s governing schema, the KBpedia Knowledge Ontology (KKO), is the first knowledge graph to explicitly embrace this triadic logic. Later articles will discuss KKO in much greater detail. Peirce’s logic of semiosis and his three universal categories provide the missing perspective of classing and categorizing the world around us. The irreducible truth of ‘threes’ is the essential foundation for representing knowledge and language.


[1] Similar senses are conveyed by the Wiktionary definition of knowledge.
[4] See, for example, Ludger Jansen, 2008. “Categories: The Top-level Ontology,” Applied ontology: An introduction (2008): 173-196, and Nicola Guarino, 1997. “Some Organizing Principles For A Unified Top-Level Ontology,” National Research Council, LADSEB-CNR Int. Report, V3.0, August 1997.
[5] M.K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?“, AI3:::Adaptive Information blog, January 24, 2012.
[6] See further Catherine Legg, 2010. “Pragmaticsm on the Semantic Web,” in Bergman, M., Paavola, S., Pietarinen, A.-V., & Rydenfelt, H. eds., Ideas in Action: Proceedings of the Applying Peirce Conference, pp. 173–188. Nordic Studies in Pragmatism 1. Helsinki: Nordic Pragmatism Network.
[7] See Robert Burch, 1991. A Peircean Reduction Thesis: The Foundations of Topological Logic, Texas Tech University Press, Lubbock, TX. Peirce’s reduction thesis is never stated explicitly by Peirce, but is alluded to in numerous snippets.
Posted:September 20, 2016

Download as PDF

CognontoSix Large-scale Knowledge Bases Interact to Help Automate Machine Learning Setup

Fred Giasson and I today announced the unveiling of a new venture, Cognonto. We have been working on this venture very hard for at least the past two years. But, frankly, Cognonto represents bringing into focus ideas and latent opportunities that we have been seeing for much, much longer.

The fundamental vision for Cognonto is to organize the information in large-scale knowledge bases so as to efficiently support knowledge-based artificial intelligence (KBAI), a topic I have been writing about much over the past year. Once such a vision is articulated, the threads necessary to bring it to fruition come into view quickly. First, of course, the maximum amount of information possible in the source knowledge bases needs to be made digital and represented with semantic Web technologies such as RDF and OWL. Second, since no source alone is adequate, the contributing knowledge bases need to be connected and made to work with one another in a logical and consistent manner. And, third, an overall schema needs to be put in place that is coherent and geared specifically to knowledge representation and machine learning.

The result from achieving these aims is to greatly lower the time and cost to prepare inputs to, and improve the accuracy in, machine learning.  This result applies particularly to supervised machine learning for knowledge-related applications. But, if achieved, the resulting rich structure and extensive features also lend themselves to unsupervised and deep learning, as well as to provide a powerful substrate for schema mapping and data interoperability.

Today, we’ve now made sufficient progress on this vision to enable us to release Cognonto, and the KBpedia knowledge structure at its core. Combined with local data and schema, there is much we can do with the system. But another exciting part is that the sky is the limit in terms of honing the structure, growing it, and layering more AI applications upon it. Today, with Cognonto’s release, we begin that process.

Entry Point for the Cognonto Demo

Screen Shot of the Entry Point for the Cognonto Demo

You can begin to see the power and the structure yourself via Cognonto’s online demo, as shown above, which showcases a portion of the system’s functionality.

Problem and Opportunity

Artificial intelligence (AI) and machine learning are revolutionizing knowledge systems. Improved algorithms and faster graphics chips have been contributors. But the most important factor in knowledge-based AI’s renaissance, in our opinion, has been the availability of massive digital datasets for the training of machine learners.

Wikipedia and data from search engines are central to recent breakthroughs. Wikipedia is at the heart of Siri, Cortana, the former Freebase, DBpedia, Google’s Knowledge Graph and IBM’s Watson, to name just a prominent few AI question answering systems. Natural language understanding is showing impressive gains across a range of applications. To date, all of these examples have been the result of bespoke efforts. It is very expensive for standard enterprises to leverage these knowledge resources on their own.

Today’s practices pose significant upfront and testing effort. Much latent knowledge remains unexpressed and not easily available to learners; it must be exposed, cleaned and vetted. Further upfront effort needs to be spent on selecting the features (variables) used and then to accurately label the positive and negative training sets. Without “gold standards” — at still more cost — it is difficult to tune and refine the learners. The cost to develop tailored extractors, taggers, categorizers, and natural language processors is simply too high.

So recent breakthroughs demonstrate the promise; now it is time to systematize the process and lower the costs. The insight behind Cognonto is that existing knowledge bases can be staged to automate much of the tedium and reduce the costs now required to set up and train machine learners for knowledge purposes. Cognonto’s mission is to make knowledge-based artificial intelligence (KBAI) cheaper, repeatable, and applicable to enterprise needs.

Cognonto (a portmanteau of ‘cognition’ and ‘ontology’) exploits large-scale knowledge bases and semantic technologies for machine learning, data interoperability and mapping, and fact and entity extraction and tagging. Cognonto puts its insight into practice through a knowledge structure, KBpedia, designed to support AI, and a management framework, the Cognonto Platform, for integrating external data to gain the advantage of KBpedia’s structure. We automate away much of the tedium and reduce costs in many areas, but three of the most important are:

  • Pre-staging labels for entity and relation types, essential for supervised machine learning training sets and reference standards; KBpedia’s structure-rich design is further useful for unsupervised and deep learning;
  • Fine-grained entity and relation type taggers and extractors; and
  • Mapping to external schema to enable integration and interoperability of structured, semi-structured and unstructured data (that is, everything from text to databases).

The KBpedia Knowledge Structure

KBpedia is a computable knowledge structure resulting from the combined mapping of six, large-scale, public knowledge bases — Wikipedia, Wikidata, OpenCyc, GeoNames, DBpedia and UMBEL. The KBpedia structure separately captures entities, attributes, relations and topics. These are classed into a natural and rich diversity of types, with their meaning and relationships logically and coherently organized. This diagram, one example from the online demo, shows the topics captured for the main Cognonto page in relation to the major typologies within KBpedia:

Example Network Graph as One Part of the Demo Results

Example Network Graph as One Part of the Demo Results

Each of the six knowledge bases has been mapped and re-expressed into the KBpedia Knowledge Ontology. KKO follows the universal categories and logic of the 19th century American mathematician and philosopher, Charles Sanders Peirce, the subject of my last article. KKO is a computable knowledge graph that supports inference, reasoning, aggregations, restrictions, intersections, and other logical operations. KKO’s logic basis provides a powerful way to represent individual things, classes of things, and how those things may combine or emerge as new knowledge. You can inspect the upper portions of the KKO structure on the Cognonto Web site. Better still, if you have an ontology editor, you can download and inspect the open source KKO directly.

KBpedia contains nearly 40,000 reference concepts (RCs) and about 20 million entities. The combination of these and KBpedia’s structure results in nearly 7 billion logical connections across the system, as these KBpedia statistics (current as of today’s version 1.02 release) show:

Measure Value
No KBpedia reference concepts (RCs) 38,930
No. mapped vocabularies 27
Core knowledge bases 6
Extended vocabularies 21
No. mapped classes 138,868
Core knowledge bases 137,203
Extended vocabularies 1,665
No. typologies (SuperTypes) 63
Core entity types 33
Other core types 5
Extended 25
Typology assignments 545,377
No. aspects 80
Direct entity assignments 88,869,780
Inferred entity aspects 222,455,858
No. unique entities 19,643,718
Inferred no of entity mappings 2,772,703,619
Total no. of “triples” 3,689,849,726
Total no. of inferred and direct assertions 6,482,197,063
First Release KBpedia Statistics

About 85% of the RCs are themselves entity types — that is, 33,000 natural classes of similar entities such as ‘astronauts’ or ‘breakfast cereals’ — which are organized into about 30 “core” typologies that are mostly disjoint (non-overlapping) with one another. KBpedia has extended mappings to a further 20 other vocabularies, including schema.org, Dublin Core, and others; client vocabularies are typical additions. The typologies provide a flexible means for slicing-and-dicing the knowledge structure; the entity types provide the tie-in points to KBpedia’s millions of individual instances (and for your own records). KBpedia is expressed in the semantic Web languages of OWL and RDF. Thus, most W3C standards may be applied against the KBpedia structure, including for linked data, a standard option.

KBpedia is purposefully designed to enable meaningful splits across any of its structural dimensions — concepts, entities, relations, attributes, or events. Any of these splits — or other portions of KBpedia’s rich structure — may be the computable basis for training taggers, extractors or classifiers. Standard NLP and machine learning reference standards and statistics are applied during the parameter-tuning and learning phases. Multiple learners and recognizers may also be combined as different signals to an ensemble approach to overall scoring. Alternatively, KBpedia’s slicing-and-dicing capabilities may drive export routines to use local or third-party ML services under your own control.

Though usable in a standalone mode, only slices of KBpedia may be applicable to a given problem or domain, which then most often need to be extended with local data and schema. Cognonto has services to incorporate your own domain and business data, critical to fulfill domain purposes and to respond to your specific needs. We transform your external and domain data into KBpedia’s canonical forms for interacting with the overall structure. Such data may include other public databases, but also internal, customer, product, partner, industry, or research information. Data may range from unstructured text in documents to semi-structured tags or metadata to spreadsheets or fully structured databases. The formats of the data may span hundreds of document types to all flavors of spreadsheets and databases.

Platform and Technology

Cognonto’s modular technology is based on Web-oriented architectures. All functionality is exposed via Web services and programmatically in a microservice design. The technology for Cognonto resides in three inter-related areas:Cognonto

  • Cognonto Platform – the technology for storing, accessing, mapping, visualizing, querying, managing, analyzing, tagging, reporting and machine learning using KBpedia;
  • KBpedia Structure – the central knowledge structure of organized and mapped knowledge bases and their millions of instances; and
  • Build Infrastructure – repeatable and modifiable build and coherence and consistency testing scripts, including reference standards.

The Cognonto Web services may be manipulated directly from the command line or via cURL calls, or by simple HTML interfaces, by SPARQL, or programmatically. The Web services are written in Clojure and follow literate programming practices.

The base KBpedia knowledge graph may be explored interactively across billions of combinations with sample exports of its content. Here is the example for automobile.

Example Screen Shot for a Portion of the Knowledge Graph

Example Screen Shot for a Portion of the Knowledge Graph Results

There is a lot going on with many results panels and with links throughout the structure. There is a ‘How to’? for the knowledge graph if you really want to get your hands dirty.

These platform, technology, and knowledge structure capabilities combine to enable us to offer services across the full spectrum of KBAI applications, including:

Cognonto is a foundation for doing serious knowledge-based artificial intelligence.

Today and Tomorrow

Despite the years we have been working on this, it very much feels like we are at the beginning. There is so much more that can be done.

First, we need to continue to wring out errors and mis-assignments in the structure. We estimate an accuracy error rate of 1-2% currently, but that still represents millions of potential errors. The objective is not to be more accurate than alternatives, which we already are, but to be the most effective foundation possible for training machine learners. Further cleaning will result in still better standards and mappings. Throughout the interactive knowledge graph we have a button for submitting errors; please so submit if you see any problems!

Second, we are seeing the value of exposing structure, and the need to keep doing so. Each iteration of structure gets easier, because prior ones may be applied to automate much of the testing and vetting effort for the subsequent ones. Structure provides the raw feature (variable) grist used by machine learners. We have a very long punch list of where we can effectively add more structure to KBpedia.

And, last, we need to extend the mappings to more knowledge bases, more vocabularies, and more schema. This kind of integration is really what smooths the way to data integration and interoperability. Virtually every problem and circumstance requires including local and external information.

We know there are many important uses — and an upside of potential — for codifying knowledge bases for AI and machine learning purposes. Drop me a line if you’d like to discuss how we can help you leverage your own domain and business data using knowledge-based AI.

Posted:September 12, 2016

Charles Sanders Peirce, courtesy WikimediaHis Triadic Logic is the Mindset for Categorization

Many of us involved in semantic technologies or information science grapple with the question of categorization. How do we provide a coherent organization of the world that makes sense? Better still, how might we represent this coherent structure in a manner that informs how we can extend or grow our knowledge domains? Most problems of a practical nature require being able to combine information together so as to inform new knowledge. Categories that bring together (generalize) similar things are a key way to aid that.

Embracing semantic technologies means, among standards and other things, that the natural structural representation of domains is the graph. These are formally specified using either RDF or OWL. These ontologies have objects as nodes, and properties between those nodes as edges. I believe in this model, and have worked for at least a decade to promote its use. It is the model used by Google’s knowledge graph, for example.

Knowledge graphs that are upper ontologies typically have 80% to 85% of their nodes acting to group similar objects, mostly what could be axiomatized as ‘classes’ or ‘types’. This realization naturally shifts focus to, then, how are these groups formed? What are the bases to place multiple instances into a given class? Are types the same things as classes?

Knowledge, inherently open and dynamic, can only be used for artificial intelligence when it is represented by structures readable by machines. Digitally readable structures of knowledge and features are essential for machine learning, natural language understanding, or other AI functions. Indeed, were such structures able to be expressed in a mostly automatic way, the costs and efforts to perform AI and natural language processing and understanding functions (NLP and NLU) would be greatly lessened.

Open and dynamic also means that keeping the knowledge base current requires simple principles to educate and train those charged with keeping the structure up to date. Nothing is perfect, humans or AI. Discovery and truth only result from questioning and inspection. The entire knowledge graph is fallible and subject to growth and revision. Human editors — trained and capable — are essential to maintain the integrity of such structures, automation or AI not withstanding. Fundamentally, then, the challenge becomes how to think simply about grouping things and forming categories. Discovery of simplicity is hard without generalization and deep thought.

A Peircean View in Thirdness

Scholars of Charles Sanders Peirce (“purse”) (1839 – 1913) [1] all acknowledge how infused his writings on logic, semiosis, philosophy, and knowledge are with the idea of “threes”. His insights are perhaps most studied with respect to his semiosis of signs, with the triad formed by object, representation, and interpretation. But Peirce recognized many prior philosophers, particularly Kant and Hegel, had also made “threes” a cornerstone of their views. Peirce studied and wrote on what makes “threes” essential and irreducible. His generalization, or abstraction if you will, he called simply the Three Categories, and to reflect their fundamental nature, called each separately as Firstness, Secondness and Thirdness. In his writings over decades, he related or described this trichotomy in dozens of contexts [2].

Across his voluminous writings, which unfortunately are not all available since they are still being transcribed from tens of thousands of original handwritten notes, I glean from the available materials this understanding of his three categories from a knowledge representation standpoint:

  • Firstness [1ns] — these are potentials, the basic qualities that may combine together or interact in various ways to enable the real things we perceive in the world. They are unexpressed potentialities, the substrate of the actual. These are the unrealized building blocks, or primitives, the essences or attributes or possible juxtapositions; indeed, “these” and “they” are misnomers because, once conceived, the elements of Firstness are no longer Firstness;
  • Secondness [2ns] — these are the particular realized things or concepts in the world, what we can perceive, point to and describe (including the idea of Firstness, Thirdness, etc.) A particular is also known as an entity, instance or individual;
  • Thirdness [3ns] — these are the laws, habits, regularities and continuities that may be generalized from particulars. All generals — what are also known as classes, kinds or types — belong to this category. The process of finding and deriving these generalities also leads to new insights or emergent properties, what Peirce called the “surprising fact.” 

Understanding, inquiry and knowledge require this irreducible structure; connections, meaning and communication depend on all three components, standing in relation to one another and subject to interpretation by multiple agents (Peirce’s semiosis of signs). Contrast this Peircean view with traditional classification schemes, which have a dyadic or dichotomous nature and do not support such rich views of context and interpretation.

Peirce’s “surprising fact” is new knowledge that emerges from anomalies observed when attempting to generalize or to form habits. Abductive reasoning, a major contribution by Peirce, attempts to probe why the anomaly occurs. The possible hypotheses so formed constitute the Firstness or potentials of a new categorization (identification of particulars and generalization of the phenomena). The scientific method is grounded in this process and reflects the ideal of this approach (what Peirce called the “methodeutic”).

Peirce at a High Altitude

Significant terms we associate with knowledge and its discovery include open, dynamic, process, representation, signification, interpretation, logic, coherence, context, reality, and truth. These were all topics of Peirce’s deep inquiry and explained by him via his triadic world view. For example, Peirce believed in the real as having existence apart from the mind (a refutation of Descartes’ view). He believed there is truth, that it can be increasingly revealed by the scientific method and social consensus (agreement of signs), but current belief as to what is “truth” is fallible and can never be realized in the absolute (it is a limit function). There is always distance and different interpretation between the object, its representation, and its interpretation. But this same logic provides the explanation for the process of categorization, also grounded in Firstness, Secondness and Thirdness [2].

Of course, some Peircean scholars may rightfully see these explanations as a bit of a cartoon, and a possible injustice to his lifetime of work. For more than 100 years philosophers and logicians have tried to plumb Peirce’s insights and writings. This summary by no means captures many subtleties. But, if we ourselves generalize across Peirce’s writings and his application of the Three Categories, we can gain a mindset that, I submit, is both easily grasped and applied, the result of which is a logical, coherent approach to categorization and knowledge representation.

First, we decide the focus of the categorization effort. That may arise from one of three sources. We either are trying to organize a knowledge domain anew; we are splitting an existing category that has become too crowded and difficult to reason over; or we have found a “surprising fact” or are trying to plumb an anomaly. Any of these can trigger the categorization process (and, notice, they are in 1ns, 2ns and 3ns splits). The breadth or scope of the category is based on the domain and the basis of the categorization effort.

How to think about the new category and decide its structure comes from the triad:

  • Firstness – the potential things, ideas, concepts, entities, forces, factors, events, whatever that potentially bear upon or have relevance to the category; think of it as the universe of thought that might be brought to bear for the new category of inquiry
  • Secondness – the particular instances, real and imagined, that may populate the information space for the category, including the ideas of attributes and relations, which also need to be part of the Firstness
  • Thirdness – the generals, types, regularities, patterns, or logical groupings that may arise from combinations of any of these factors. Similarities, “truth” and predictability help inform these groupings.

What constitutes the potentials, realized particulars, and generalizations that may be drawn from a query or investigation is contextual in nature. I outlined more of the categorization process in an earlier article [2].

Peirce’s triadic logic is a powerful mindset for how to think about and organize the things and ideas in our world. Peirce’s triadic logic and views on categorization are fractal in nature. We can apply this triadic logic to any level of information granularity. The graph structure arises from the connections amongst all of these 1ns, 2ns and 3ns factors.

We will be talking further how this 40,000 ft view of the Peircean mindset helps create practical knowledge graphs and ontological structures. We will also be showing an example suitable for knowledge-based artificial intelligence (KBAI). The exciting point is that we have found a simple grounding of three aspects that is logically sound and can be readily trained. We also will be showing how we can do so much more work against this kind of natural KBAI structure.

Stay tuned.


[1] A tremendous starting point for information on Peirce is the category about him on Wikipdia, starting with his eponymous page.
[2] M.K. Bergman, 2016. “A Foundational Mindset: Firstness, Secondness, Thirdness,” AI3:::Adaptive Information blog, March 21, 2016.
Posted:August 9, 2016

Peg ProjectContinued Visibility for the Award-winning Web Portal

Laszlo Pinter, the individual who hired us as the technical contractor for the Peg community portal (www.mypeg.ca), recently gave a talk on the project at a TEDx conference in Winnipeg. Peg is the well-being indicator system for the community of Winnipeg. Laszlo’s talk is a 15-minute, high-level overview of the project and its rationale and role:

Peg helps identify and track indicators that relate to the economic, environmental, cultural and social well-being of the people of Winnipeg. There are scores of connected datasets underneath Peg that relate all information from stories to videos and indicator data to one another using semantic technologies. I first wrote about Peg when it was released at the end of 2013.

In 2014, Peg won the international Community Indicators Consortium Impact Award. The Peg Web site is a joint project of the United Way of Winnipeg (UWW)  and the International Institute for Sustainable Development (IISD). Our company, Structured Dynamics, was the lead developer for the project, which is also based on SD’s Open Semantic Framework (OSF) platform.

Congratulations to the Peg team for the well-deserved visibility!

Posted:July 18, 2016

NLP and ML Gold StandardsReference Standards are Not Just for Academics

It is common — if not nearly obligatory — for academic researchers in natural language processing (NLP) and machine learning (ML) to compare the results of their new studies to benchmark, reference standards. I outlined some of the major statistical tests in a prior article [1]. The requirement to compare research results to existing gold standards makes sense: it provides an empirical basis for how the new method compares to existing ones, and by how much. Precision, recall, and the combined F1 score are the most prominent amongst these statistical measures.

Of course, most enterprise or commercial projects are done for proprietary purposes, with results infrequently published in academic journals. But, as I argue in this article, even though enterprise projects are geared to the bottom line and not the journal byline, the need for benchmarks, and reference and gold standards, is just as great — perhaps greater — for commercial uses. But there is more than meets the eye with some of these standards and statistics. Why following gold standards makes sense and how my company, Structured Dynamics, does so are the subjects of this article.

A Quick Primer on Standards and Statistics

The most common scoring methods to gauge the “accuracy” of natural language or supervised machine learning analysis involves statistical tests based on the ideas of negatives and positives, true or false. We can measure our correct ‘hits’ by applying our statistical tests to a “gold standard” of known results. This gold standard provides a representative sample of what our actual population looks like, one we have characterized in advance whether results in the sample are true or not for the question at hand. Further, we can use this same gold standard over and over again to gauge improvements in our test procedures.

‘Positive’ and ‘negative’ are simply the assertions (predictions) arising from our test algorithm of whether or not there is a match or a ‘hit’. ‘True’ and ‘false’ merely indicate whether these assertions proved to be correct or not as determined by the reference standard. A false positive is a false alarm, a “crying wolf”; a false negative is a missed result. Combining these thoughts leads to a confusion matrix, which lays out how to interpret the true and false, positive and negative results:

Correctness Test Assertion
Positive Negative
True TP
True Positive
TN
True Negative
False FP
False Positive
FN
False Negative

These four characterizations — true positive, false positive, true negative, false negative — now give us the ability to calculate some important statistical measures.

The first metric captures the concept of coverage. In standard statistics, this measure is called sensitivity; in IR (information retrieval) and NLP contexts it is called recall. Basically it measures the ‘hit’ rate for identifying true positives out of all potential positives, and is also called the true positive rate, or TPR:

\mathit{TPR} = \mathit{TP} / P = \mathit{TP} / (\mathit{TP}+\mathit{FN})

Expressed as a fraction of 1.00 or a percentage, a high recall value means the test has a high “yield” for identifying positive results.

Precision is the complementary measure to recall, in that it is a measure for how efficient whether positive identifications are true or not:

\text{precision}=\frac{\text{number of true positives}}{\text{number of true positives}+\text{false positives}}

Precision is something, then, of a “quality” measure, also expressed as a fraction of 1.00 or a percentage. It provides a positive predictive value, as defined as the proportion of the true positives against all the positive results (both true positives and false positives).

Thus, recall gives us a measure as to the breadth of the hits captured, while precision is a statement of whether our hits are correct or not. We also see why false positives need to be a focus of attention in test development: they directly lower precision and efficiency of the test.

That precision and recall are complementary and linked is reflected in one of the preferred overall measures of IR and NLP statistics, the F-score, which is the adjusted (beta) mean of precision and recall. The general formula for positive real β is:

F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}}.

which can be expressed in terms of TP, FN and FP as:

F_\beta = \frac {(1 + \beta^2) \cdot \mathrm{true\ positive} }{(1 + \beta^2) \cdot \mathrm{true\ positive} + \beta^2 \cdot \mathrm{false\ negative} + \mathrm{false\ positive}}\,

In many cases, the harmonic mean is used, which means a beta of 1, which is called the F1 statistic:

F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}

But F1 displays a tension. Either precision or recall may be improved to achieve an improvement in F1, but with divergent benefits or effects. What is more highly valued? Yield? Quality? These choices dictate what areas of improvement need to receive focus. As a result, the weight of beta can be adjusted to favor either precision or recall.

Accuracy is another metric that can factor into this equation, though it is a less referenced measure in the IR and NLP realms. Accuracy is the statistical measure of how well a binary classification test correctly identifies or excludes a condition:

\text{accuracy}=\frac{\text{number of true positives}+\text{number of true negatives}}{\text{number of true positives}+\text{false positives} + \text{false negatives} + \text{true negatives}}

An accuracy of 100% means that the measured values are exactly the same as the given values.

All of the measures above simply require the measurement of false and true, positive and negative, as do a variety of predictive values and likelihood ratios. Relevance, prevalence and specificity are some of the other notable measures that depend solely on these metrics in combination with total population [2].

Not All Gold Standards Shine

Gold standards that themselves contain false positives and false negatives, by definition, immediately introduce errors. These errors make it difficult to test and refine existing IR and NLP algorithms, because the baseline is skewed. And, because gold standards also often inform training sets, errors there propagate into errors in machine learning. It is also important to include true negatives in a gold standard, in the likely ratio expected by the overall population, so as to improve overall accuracy [3].

There is a reason that certain standards, such as the NYT Annotated Corpus or the Penn Treebank [4], are often referenced as gold standards. They have been in public use for some time, with many errors edited from the systems. Vetted standards such as these may have inter-annotator agreements [5] in the range of 80% to 90% [4].  More typical use cases in biomedical notes [6] and encyclopedic topics [7] tend to show inter-annotator agreements in the range of 75% to 80%.

A proper gold standard should also be constructed to provide meaningful input to performance statistics. Per above, we can summarize these again as:

  • TP = standard provides labels for instances of the same types as in the target domain; manually scored
  • FP = manually scored for test runs based on the current configuration; test indicates as positive, but deemed not true
  • TN = standard provides somewhat similar or ambiguous instances from disjoint types labeled as negative; manually scored
  • FN = manually scored for test runs based on the current configuration; test indicates as negative, but deemed not true.

It is further widely recognized that the best use for a reference standard is when it is constructed in exact context to its problem domain, including the form and transmission methods of the message. A reference standard appropriate to Twitter is likely not a good choice to analyze legal decisions, for example.

So, we can see many areas by which gold, or reference, standards may not be constructed equally:

  1. They may contain false positives
  2. They have variable inter-annotator agreement
  3. They have variable mechanisms, most with none, for editing and updating the labels
  4. They may lack sufficient inclusion of negatives
  5. They may be applied to an out–of-context domain or circumstance.

Being aware of these differences and seeking hard information about them are essential considerations whenever a serious NLP or ML project is being contemplated.

Seemingly Good Statistics Can Lead to Bad Results

We may hear quite high numbers for some NLP experiments, sometimes in the mid-90% to higher range. Such numbers sound impressive, but what do they mean and what might they not be saying?

We humans have a remarkable ability to see when things are not straight, level or plumb. We have a similar ability to spot errors in long lists and orders of things. While a claimed accuracy of even, say, 95% sounds impressive, applied to a large knowledge graph such as UMBEL [8], with its 35,000 concepts, translates into 1,750 misassignments. That sounds like a lot, and it is. Yet misassignments of some nature occur within any standard. When they occur, they are sometimes glaringly obvious, like being out of plumb. It is actually pretty easy to find most errors in most systems.

Still, for the sake of argument, let’s accept we have applied a method that has a claimed accuracy of 95%. But, remember, this is a measure applied against the gold standard. If we take the high-end of the inter-annotator agreements for domain standards noted above, namely 80%, then we have this overall accuracy within the system:

.80 x .95 = 0.76

Whoa! Now, using this expanded perspective, for a candidate knowledge graph the size of UMBEL — that is, about 35 K items — we could see as many as 8,400 misassignments. Those numbers now sound really huge, and they are. They are unacceptable.

A couple of crucial implications result from this simple analysis. First, it is essential to take a holistic view of the error sources across the analysis path, including and most especially the reference standards. (They are, more often than not IMO, the weak link in the analysis path.) And, second, getting the accuracy of reference standards as high as possible is crucial to training the best learners for the domain problem. We discuss this second implication next.

How to Get the Standards High

There is a reason the biggest Web players are in the forefront of artificial intelligence and machine learning. They have the resources — and most critically the data — to create effective learners. But, short of the biggest of Big Data, how can smaller players compete in the NLP and machine learning front?

Today, we have high-quality (but with many inaccuracies) public data sets ranging from millions of entity types and concepts in all languages with Wikipedia data, and a complementary set of nearly 20 million entities in Wikidata, not to mention thousands more of high-quality public datasets. For a given enterprise need, if this information can be coherently organized, structured to the maximum extent, and subject to logic and consistency tests for typing, relationships, and attributes, we have the basis to train learners with standards of unprecedented accuracy. (Of course, proprietary concepts and entity data should also figure prominently into this mix.) Indeed, this is the premise behind Structured Dynamics’ efforts in knowledge-based artificial intelligence.

KBAI is based on a curated knowledge base eating its own tail, working through cycles of consistency and logic testing to reduce misassignments, while continually seeking to expand structure and coverage. There is a network effect to these efforts, as adding and testing structure or mapping to new structures and datasets continually gets easier. These efforts enable the knowledge structure to be effectively partitioned for training specific recognizers, classifiers and learners, while also providing a logical reference structure for adding new domain and public data and structure.

This basic structure — importantly supplemented by the domain concepts and entities relevant to the customer at hand — is then used to create reference structures for training the target recognizers, classifiers and learners. The process of testing and adding structure identifies previously hidden inconsistencies. As corrected, the overall accuracy of the knowledge structure to act in a reference mode increases. At Structured Dynamics, we began this process years ago with the initial UMBEL reference concept structure. To that we have mapped and integrated a host of public data systems, including OpenCyc, Wikipedia, DBpedia, and, now, Wikidata. Each iteration broadens our scope and reduces errors, leading to a constantly more efficient basis for KBAI.

An integral part of that effort is to create gold standards for each project we engage. You see, every engagement has its own scope and wrinkles. Besides domain data and contexts, there are always specific business needs and circumstances that need to be applied to the problem at hand. The domain coverage inevitably requires new entity or relation recognizers, or the mapping of new datasets. The nature of the content at hand may range from tweets to ads to Web pages or portions or academic papers, with specific tests and recognizers from copyrights to section headings informing new learners. Every engagement requires its own reference standards. Being able to create these efficiently and with a high degree of accuracy is a competitive differentiator.

SD’s General Approach to Enterprise Standards

Though Structured Dynamics’ efforts are geared to enterprise projects, and not academic papers, the best practices of scientific research still apply. We insist upon the creation of gold standards for every discrete recognizer, classifier or learner we undertake for major clients. This requirement is not a hard argument to make, since we have systems in place to create initial standards and can quantify the benefits from the applied standards. Since major engagements often involve the incorporation of new data and structure, new feature recognizers, or bespoke forms of target content, the gold standards give us the basis for testing all wrinkles and parameters. The cost advantages of testing alternatives efficiently is demonstrable. On average, we can create a new reference standard in 10-20 labor hours (each for us and the client).

Specifics may vary, but we typically seek about 500 true positive instances per standard, with 20 or so true negatives. (As a note, there are more than 1,900 entity and relation types in Wikidata — 800 and 1,100 types, respectively — that meet this threshold. However, it is also not difficult to add hundreds of new instances from alternative sources.) All runs are calibrated with statistics reporting. In fact, any of our analytic runs may invoke the testing statistics, which are typically presented like this for each run:

True positives:  362
False positives:  85
True negatives:  19
False negatives:  45

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.8098434  |
| :recall      | 0.8894349  |
| :specificity | 0.1826923  |
| :accuracy    | 0.7455969  |
| :f1          | 0.84777516 |
| :f2          | 0.8722892  |
| :f0.5        | 0.82460135 |
+--------------+------------+

When we are in active testing mode we are able to iterate parameters and configurations quickly, and discover thrusts that have more or less effect on desired outcomes. We embed these runs in electronic notebooks using literate programming to capture and document our decisions and approach as we go [9]. Overall, the process has proven (and improved!) to be highly effective.

We could conceivably lower the requirement for 500 true positive instances as we see the underlying standards improve. However, since getting this de minimus of examples has become systematized, we really have not had reason for testing and validating smaller standard sizes. We are also not seeking definitive statistical test values but a framework for evaluating different parameters and methods. In most cases, we have seen our refererence sets grow over time as new wrinkles and perspectives emerge that require testing.

In all cases, the most important factor in this process has been to engage customers in manual review and scoring. More often than not we see client analysts understand and detect patterns that then inform improved methods. Both us, as the contractor, and the client gain a stake and an understanding of the importance of reference standards.

Clean, vetted gold standards and training sets are thus a critical component to improving our client’s results — and our own knowledge bases — going forward. The very practice of creating gold standards and training sets needs to receive as much attention as algorithm development because, without it, we are optimizing algorithms to fuzzy objectives.


[1] M.K. Bergman, 2015. “A Primer on Knowledge Statistics,” AI3:::Adaptive Information blog, May 18, 2015.
[2] By bringing in some other rather simple metrics, it is also possible to expand beyond this statistical base to cover such measures as information entropy, statistical inference, pointwise mutual information, variation of information, uncertainty coefficients, information gain, AUCs and ROCs. But we’ll leave discussion of some of those options until another day.
[3] George Hripcsak and Adam S. Rothschild, 2005. “Agreement, the F-measure, and Reliability in Information Retrieval.” Journal of the American Medical Informatics Association 12, no. 3 (2005): 296-298.
[4] See Eleni Miltsakaki, Rashmi Prasad, Aravind K. Joshi, and Bonnie L. Webber, 2004. “The Penn Discourse Treebank,” in LREC. 2004. For additional useful statistics and an example of high inter-annotator agreement, see Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel, 2006. “OntoNotes: the 90% Solution,” in Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 57-60. Association for Computational Linguistics, 2006.
[5] Inter-annotator agreement is the degree of agreement among raters or annotators of scoring or labeling for reference standards. The phrase embraces or overlaps a number of other terms for multiple-judge systems, such as inter-rater agreement, inter-observer agreement, or inter-rater reliability. See also Ron Artstein and Massimo Poesio, 2008. “Inter-coder Agreement for Computational Linguistics,” Computational Linguistics 34, no. 4 (2008): 555-596. Also see Kevin A. Hallgren, 2012. “Computing Inter-rater Reliability for Observational Data: An Overview and Tutorial,” Tutorials in Quantitative Methods for Psychology 8, no. 1 (2012): 23.
[6] Philip V. Ogren, Guergana K. Savova, and Christopher G. Chute, 2007. “Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition,” in Medinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, p. 2325. IOS Press, 2007. This study shows inter-annotator agreement of .75 for biomedical notes.
[7] Vaselin Stoyanov and Claire Cardie, 2008. “Topic identification for fine-grained opinion analysis.” In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 817-824. Association for Computational Linguistics, 2008. shows inter-annotator agreement of ~76% for fine-grained topics. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin 2010. “Automatic evaluation of topic coherence.” In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100-108. Association for Computational Linguistics, 2010, shows inter-annotator agreement in the .73 to .82 range.
[8] UMBEL (Upper Mapping and Binding Exchange Layer) is a logically organized knowledge graph of about 35,000 concepts and entity types that can be used in information science for relating information from disparate sources to one another. This open-source ontology was originally developed by Structured Dynamics, which still maintains it. It is used to assist data interoperability and the mapping of disparate datasets.
[9] Fred Giasson, Structured Dynamics’ CTO, has been writing a series of blog posts on literate programming and the use of Org-mode as an electronic notebook. I have provided a broader overview of SD’s efforts in this area; see M.K. Bergman, 2016. “Literate Programming for an Open World,” AI3:::Adaptive Information blog, June 27, 2016.

Posted by AI3's author, Mike Bergman Posted on July 18, 2016 at 10:59 am in Knowledge-based Artificial Intelligence, UMBEL | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/1964/gold-standards-in-enterprise-knowledge-projects/
The URI to trackback this post is: http://www.mkbergman.com/1964/gold-standards-in-enterprise-knowledge-projects/trackback/