Posted:July 18, 2016

NLP and ML Gold StandardsReference Standards are Not Just for Academics

It is common — if not nearly obligatory — for academic researchers in natural language processing (NLP) and machine learning (ML) to compare the results of their new studies to benchmark, reference standards. I outlined some of the major statistical tests in a prior article [1]. The requirement to compare research results to existing gold standards makes sense: it provides an empirical basis for how the new method compares to existing ones, and by how much. Precision, recall, and the combined F1 score are the most prominent amongst these statistical measures.

Of course, most enterprise or commercial projects are done for proprietary purposes, with results infrequently published in academic journals. But, as I argue in this article, even though enterprise projects are geared to the bottom line and not the journal byline, the need for benchmarks, and reference and gold standards, is just as great — perhaps greater — for commercial uses. But there is more than meets the eye with some of these standards and statistics. Why following gold standards makes sense and how my company, Structured Dynamics, does so are the subjects of this article.

A Quick Primer on Standards and Statistics

The most common scoring methods to gauge the “accuracy” of natural language or supervised machine learning analysis involves statistical tests based on the ideas of negatives and positives, true or false. We can measure our correct ‘hits’ by applying our statistical tests to a “gold standard” of known results. This gold standard provides a representative sample of what our actual population looks like, one we have characterized in advance whether results in the sample are true or not for the question at hand. Further, we can use this same gold standard over and over again to gauge improvements in our test procedures.

‘Positive’ and ‘negative’ are simply the assertions (predictions) arising from our test algorithm of whether or not there is a match or a ‘hit’. ‘True’ and ‘false’ merely indicate whether these assertions proved to be correct or not as determined by the reference standard. A false positive is a false alarm, a “crying wolf”; a false negative is a missed result. Combining these thoughts leads to a confusion matrix, which lays out how to interpret the true and false, positive and negative results:

Correctness Test Assertion
Positive Negative
True TP
True Positive
TN
True Negative
False FP
False Positive
FN
False Negative

These four characterizations — true positive, false positive, true negative, false negative — now give us the ability to calculate some important statistical measures.

The first metric captures the concept of coverage. In standard statistics, this measure is called sensitivity; in IR (information retrieval) and NLP contexts it is called recall. Basically it measures the ‘hit’ rate for identifying true positives out of all potential positives, and is also called the true positive rate, or TPR:

\mathit{TPR} = \mathit{TP} / P = \mathit{TP} / (\mathit{TP}+\mathit{FN})

Expressed as a fraction of 1.00 or a percentage, a high recall value means the test has a high “yield” for identifying positive results.

Precision is the complementary measure to recall, in that it is a measure for how efficient whether positive identifications are true or not:

\text{precision}=\frac{\text{number of true positives}}{\text{number of true positives}+\text{false positives}}

Precision is something, then, of a “quality” measure, also expressed as a fraction of 1.00 or a percentage. It provides a positive predictive value, as defined as the proportion of the true positives against all the positive results (both true positives and false positives).

Thus, recall gives us a measure as to the breadth of the hits captured, while precision is a statement of whether our hits are correct or not. We also see why false positives need to be a focus of attention in test development: they directly lower precision and efficiency of the test.

That precision and recall are complementary and linked is reflected in one of the preferred overall measures of IR and NLP statistics, the F-score, which is the adjusted (beta) mean of precision and recall. The general formula for positive real β is:

F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}}.

which can be expressed in terms of TP, FN and FP as:

F_\beta = \frac {(1 + \beta^2) \cdot \mathrm{true\ positive} }{(1 + \beta^2) \cdot \mathrm{true\ positive} + \beta^2 \cdot \mathrm{false\ negative} + \mathrm{false\ positive}}\,

In many cases, the harmonic mean is used, which means a beta of 1, which is called the F1 statistic:

F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}

But F1 displays a tension. Either precision or recall may be improved to achieve an improvement in F1, but with divergent benefits or effects. What is more highly valued? Yield? Quality? These choices dictate what areas of improvement need to receive focus. As a result, the weight of beta can be adjusted to favor either precision or recall.

Accuracy is another metric that can factor into this equation, though it is a less referenced measure in the IR and NLP realms. Accuracy is the statistical measure of how well a binary classification test correctly identifies or excludes a condition:

\text{accuracy}=\frac{\text{number of true positives}+\text{number of true negatives}}{\text{number of true positives}+\text{false positives} + \text{false negatives} + \text{true negatives}}

An accuracy of 100% means that the measured values are exactly the same as the given values.

All of the measures above simply require the measurement of false and true, positive and negative, as do a variety of predictive values and likelihood ratios. Relevance, prevalence and specificity are some of the other notable measures that depend solely on these metrics in combination with total population [2].

Not All Gold Standards Shine

Gold standards that themselves contain false positives and false negatives, by definition, immediately introduce errors. These errors make it difficult to test and refine existing IR and NLP algorithms, because the baseline is skewed. And, because gold standards also often inform training sets, errors there propagate into errors in machine learning. It is also important to include true negatives in a gold standard, in the likely ratio expected by the overall population, so as to improve overall accuracy [3].

There is a reason that certain standards, such as the NYT Annotated Corpus or the Penn Treebank [4], are often referenced as gold standards. They have been in public use for some time, with many errors edited from the systems. Vetted standards such as these may have inter-annotator agreements [5] in the range of 80% to 90% [4].  More typical use cases in biomedical notes [6] and encyclopedic topics [7] tend to show inter-annotator agreements in the range of 75% to 80%.

A proper gold standard should also be constructed to provide meaningful input to performance statistics. Per above, we can summarize these again as:

  • TP = standard provides labels for instances of the same types as in the target domain; manually scored
  • FP = manually scored for test runs based on the current configuration; test indicates as positive, but deemed not true
  • TN = standard provides somewhat similar or ambiguous instances from disjoint types labeled as negative; manually scored
  • FN = manually scored for test runs based on the current configuration; test indicates as negative, but deemed not true.

It is further widely recognized that the best use for a reference standard is when it is constructed in exact context to its problem domain, including the form and transmission methods of the message. A reference standard appropriate to Twitter is likely not a good choice to analyze legal decisions, for example.

So, we can see many areas by which gold, or reference, standards may not be constructed equally:

  1. They may contain false positives
  2. They have variable inter-annotator agreement
  3. They have variable mechanisms, most with none, for editing and updating the labels
  4. They may lack sufficient inclusion of negatives
  5. They may be applied to an out–of-context domain or circumstance.

Being aware of these differences and seeking hard information about them are essential considerations whenever a serious NLP or ML project is being contemplated.

Seemingly Good Statistics Can Lead to Bad Results

We may hear quite high numbers for some NLP experiments, sometimes in the mid-90% to higher range. Such numbers sound impressive, but what do they mean and what might they not be saying?

We humans have a remarkable ability to see when things are not straight, level or plumb. We have a similar ability to spot errors in long lists and orders of things. While a claimed accuracy of even, say, 95% sounds impressive, applied to a large knowledge graph such as UMBEL [8], with its 35,000 concepts, translates into 1,750 misassignments. That sounds like a lot, and it is. Yet misassignments of some nature occur within any standard. When they occur, they are sometimes glaringly obvious, like being out of plumb. It is actually pretty easy to find most errors in most systems.

Still, for the sake of argument, let’s accept we have applied a method that has a claimed accuracy of 95%. But, remember, this is a measure applied against the gold standard. If we take the high-end of the inter-annotator agreements for domain standards noted above, namely 80%, then we have this overall accuracy within the system:

.80 x .95 = 0.76

Whoa! Now, using this expanded perspective, for a candidate knowledge graph the size of UMBEL — that is, about 35 K items — we could see as many as 8,400 misassignments. Those numbers now sound really huge, and they are. They are unacceptable.

A couple of crucial implications result from this simple analysis. First, it is essential to take a holistic view of the error sources across the analysis path, including and most especially the reference standards. (They are, more often than not IMO, the weak link in the analysis path.) And, second, getting the accuracy of reference standards as high as possible is crucial to training the best learners for the domain problem. We discuss this second implication next.

How to Get the Standards High

There is a reason the biggest Web players are in the forefront of artificial intelligence and machine learning. They have the resources — and most critically the data — to create effective learners. But, short of the biggest of Big Data, how can smaller players compete in the NLP and machine learning front?

Today, we have high-quality (but with many inaccuracies) public data sets ranging from millions of entity types and concepts in all languages with Wikipedia data, and a complementary set of nearly 20 million entities in Wikidata, not to mention thousands more of high-quality public datasets. For a given enterprise need, if this information can be coherently organized, structured to the maximum extent, and subject to logic and consistency tests for typing, relationships, and attributes, we have the basis to train learners with standards of unprecedented accuracy. (Of course, proprietary concepts and entity data should also figure prominently into this mix.) Indeed, this is the premise behind Structured Dynamics’ efforts in knowledge-based artificial intelligence.

KBAI is based on a curated knowledge base eating its own tail, working through cycles of consistency and logic testing to reduce misassignments, while continually seeking to expand structure and coverage. There is a network effect to these efforts, as adding and testing structure or mapping to new structures and datasets continually gets easier. These efforts enable the knowledge structure to be effectively partitioned for training specific recognizers, classifiers and learners, while also providing a logical reference structure for adding new domain and public data and structure.

This basic structure — importantly supplemented by the domain concepts and entities relevant to the customer at hand — is then used to create reference structures for training the target recognizers, classifiers and learners. The process of testing and adding structure identifies previously hidden inconsistencies. As corrected, the overall accuracy of the knowledge structure to act in a reference mode increases. At Structured Dynamics, we began this process years ago with the initial UMBEL reference concept structure. To that we have mapped and integrated a host of public data systems, including OpenCyc, Wikipedia, DBpedia, and, now, Wikidata. Each iteration broadens our scope and reduces errors, leading to a constantly more efficient basis for KBAI.

An integral part of that effort is to create gold standards for each project we engage. You see, every engagement has its own scope and wrinkles. Besides domain data and contexts, there are always specific business needs and circumstances that need to be applied to the problem at hand. The domain coverage inevitably requires new entity or relation recognizers, or the mapping of new datasets. The nature of the content at hand may range from tweets to ads to Web pages or portions or academic papers, with specific tests and recognizers from copyrights to section headings informing new learners. Every engagement requires its own reference standards. Being able to create these efficiently and with a high degree of accuracy is a competitive differentiator.

SD’s General Approach to Enterprise Standards

Though Structured Dynamics’ efforts are geared to enterprise projects, and not academic papers, the best practices of scientific research still apply. We insist upon the creation of gold standards for every discrete recognizer, classifier or learner we undertake for major clients. This requirement is not a hard argument to make, since we have systems in place to create initial standards and can quantify the benefits from the applied standards. Since major engagements often involve the incorporation of new data and structure, new feature recognizers, or bespoke forms of target content, the gold standards give us the basis for testing all wrinkles and parameters. The cost advantages of testing alternatives efficiently is demonstrable. On average, we can create a new reference standard in 10-20 labor hours (each for us and the client).

Specifics may vary, but we typically seek about 500 true positive instances per standard, with 20 or so true negatives. (As a note, there are more than 1,900 entity and relation types in Wikidata — 800 and 1,100 types, respectively — that meet this threshold. However, it is also not difficult to add hundreds of new instances from alternative sources.) All runs are calibrated with statistics reporting. In fact, any of our analytic runs may invoke the testing statistics, which are typically presented like this for each run:

True positives:  362
False positives:  85
True negatives:  19
False negatives:  45

+--------------+------------+
| key          | value      |
+--------------+------------+
| :precision   | 0.8098434  |
| :recall      | 0.8894349  |
| :specificity | 0.1826923  |
| :accuracy    | 0.7455969  |
| :f1          | 0.84777516 |
| :f2          | 0.8722892  |
| :f0.5        | 0.82460135 |
+--------------+------------+

When we are in active testing mode we are able to iterate parameters and configurations quickly, and discover thrusts that have more or less effect on desired outcomes. We embed these runs in electronic notebooks using literate programming to capture and document our decisions and approach as we go [9]. Overall, the process has proven (and improved!) to be highly effective.

We could conceivably lower the requirement for 500 true positive instances as we see the underlying standards improve. However, since getting this de minimus of examples has become systematized, we really have not had reason for testing and validating smaller standard sizes. We are also not seeking definitive statistical test values but a framework for evaluating different parameters and methods. In most cases, we have seen our refererence sets grow over time as new wrinkles and perspectives emerge that require testing.

In all cases, the most important factor in this process has been to engage customers in manual review and scoring. More often than not we see client analysts understand and detect patterns that then inform improved methods. Both us, as the contractor, and the client gain a stake and an understanding of the importance of reference standards.

Clean, vetted gold standards and training sets are thus a critical component to improving our client’s results — and our own knowledge bases — going forward. The very practice of creating gold standards and training sets needs to receive as much attention as algorithm development because, without it, we are optimizing algorithms to fuzzy objectives.


[1] M.K. Bergman, 2015. “A Primer on Knowledge Statistics,” AI3:::Adaptive Information blog, May 18, 2015.
[2] By bringing in some other rather simple metrics, it is also possible to expand beyond this statistical base to cover such measures as information entropy, statistical inference, pointwise mutual information, variation of information, uncertainty coefficients, information gain, AUCs and ROCs. But we’ll leave discussion of some of those options until another day.
[3] George Hripcsak and Adam S. Rothschild, 2005. “Agreement, the F-measure, and Reliability in Information Retrieval.” Journal of the American Medical Informatics Association 12, no. 3 (2005): 296-298.
[4] See Eleni Miltsakaki, Rashmi Prasad, Aravind K. Joshi, and Bonnie L. Webber, 2004. “The Penn Discourse Treebank,” in LREC. 2004. For additional useful statistics and an example of high inter-annotator agreement, see Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel, 2006. “OntoNotes: the 90% Solution,” in Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 57-60. Association for Computational Linguistics, 2006.
[5] Inter-annotator agreement is the degree of agreement among raters or annotators of scoring or labeling for reference standards. The phrase embraces or overlaps a number of other terms for multiple-judge systems, such as inter-rater agreement, inter-observer agreement, or inter-rater reliability. See also Ron Artstein and Massimo Poesio, 2008. “Inter-coder Agreement for Computational Linguistics,” Computational Linguistics 34, no. 4 (2008): 555-596. Also see Kevin A. Hallgren, 2012. “Computing Inter-rater Reliability for Observational Data: An Overview and Tutorial,” Tutorials in Quantitative Methods for Psychology 8, no. 1 (2012): 23.
[6] Philip V. Ogren, Guergana K. Savova, and Christopher G. Chute, 2007. “Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition,” in Medinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, p. 2325. IOS Press, 2007. This study shows inter-annotator agreement of .75 for biomedical notes.
[7] Vaselin Stoyanov and Claire Cardie, 2008. “Topic identification for fine-grained opinion analysis.” In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 817-824. Association for Computational Linguistics, 2008. shows inter-annotator agreement of ~76% for fine-grained topics. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin 2010. “Automatic evaluation of topic coherence.” In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100-108. Association for Computational Linguistics, 2010, shows inter-annotator agreement in the .73 to .82 range.
[8] UMBEL (Upper Mapping and Binding Exchange Layer) is a logically organized knowledge graph of about 35,000 concepts and entity types that can be used in information science for relating information from disparate sources to one another. This open-source ontology was originally developed by Structured Dynamics, which still maintains it. It is used to assist data interoperability and the mapping of disparate datasets.
[9] Fred Giasson, Structured Dynamics’ CTO, has been writing a series of blog posts on literate programming and the use of Org-mode as an electronic notebook. I have provided a broader overview of SD’s efforts in this area; see M.K. Bergman, 2016. “Literate Programming for an Open World,” AI3:::Adaptive Information blog, June 27, 2016.

Posted by AI3's author, Mike Bergman Posted on July 18, 2016 at 10:59 am in Knowledge-based Artificial Intelligence, UMBEL | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/1964/gold-standards-in-enterprise-knowledge-projects/
The URI to trackback this post is: https://www.mkbergman.com/1964/gold-standards-in-enterprise-knowledge-projects/trackback/
Posted:June 27, 2016

Org-mode LogoConstant Change Makes Adopting New Practices and Learning New Tools Essential

My business partner, Fred Giasson, has an uncanny ability to sense where the puck is heading. Perhaps that is due to his French Canadian heritage and his love of hockey. I suspect rather it is due to his having his finger firmly on the pulse.

When I first met Fred ten years ago he was already a seasoned veteran of the early semantic Web, with new innovative services at the time such as PingTheSemanticWeb under his belt. More recently, as we at Structured Dynamics began broadening from semantic technologies to more general artificial intelligence, Fred did his patient, quiet research and picked Clojure as our new official development language. We had much invested in our prior code base; switching main programming languages is always one of the most serious decisions a software shop can make. The choice has been brilliant, and our productivity has risen substantially. We are also able to exploit fundamentally new potentials based on a functional programming language that runs in the Java VM and has intellectual roots in Lisp.

As our work continues to shift more to knowledge bases and their use for mapping, classification, tagging, learning, etc., our challenges have been of a still different nature. Knowledge bases used in this manner are not only inherently open world because of the changing knowledge base, but because in staging them for machine learners and training sets, test, build and maintenance scripts and steps are constantly changing. Dealing with knowledge management brings substantial technical debt, and systems and procedures must be in place to deal with that. Literate programming is one means to help.

Because of Clojure and its REPL abilities that enable code to be interpreted and run dynamically at time of input, we also have been looking seriously at the notebook paradigm that has come out of interactive science lab books, and now has such exemplar programs such as iPython Notebook (Jupyter), Org-mode, Wolfram Alpha, Zeppelin, Gorilla, and others [1]. Fred had been interested in and poking at literate programming for quite a while, but his testing and use of Org-mode to keep track of our constant tests and revisions led him to take the question more seriously. You can see this example, a more-detailed example, or still another example of literate programming from Fred’s blog.

Literate programming is even a greater change than switching programming languages. To do literate programming right, there needs to be a focused commitment. To do literate programming right, workflows need to change and new tools must be learned. Is the effort worth it?

Literate Programming

Literate programming is a style of writing code and documentation first proposed by Donald Knuth. As any aspect of a coding effort is written — including its tests, configurations, installation, deployment, maintenance or experiments — written narrative and documentation accompanies it, explaining what it is, the logic of it, and what it is doing and how to exercise it. This documentation far exceeds the best-practices of in-line code commenting.

Literate programming narratives might provide background and thinking about what is being tested or tried, design objectives, workflow steps, recipes or whatever. The style and scope of documentation are similar to what might be expected in a scientist’s or inventor’s lab notebook. Indeed, the breed of emerging electronic notebooks, combined with REPL coding approaches, now enable interactive execution of functions and visualization and manipulation of results, including supporting macros.

Systems that support literate programming, such as Org-mode, can “tangle” their documents to extract the code portions for compilation and execution. They can also “weave” their documents to extract all of the documentation formatted for human readability, including using HTML. Some electronic systems can process multiple programming languages and translate functions. Some electronic systems have built-in spreadsheets and graphing libraries, and most open-source systems can be extended (though with varying degrees of difficulty and in different languages). Some of the systems are designed to interact with or publish Web pages.

Code and programs do not reside in isolation. Their operation needs to be explained and understood by others for bug fixing, use or maintenance. If they are processing systems, there are parameters and input data that are the result of much testing and refinement; they may need to be refined further again. Systems must be installed and deployed. Libraries and languages are frequently being updated for security and performance reasons; executables and environments need to be updated as well. When systems are updated, there are tests that need to be run to check for continued expected performance and accuracy. The severity of some updates may require revision to whole portions of the underlying systems. New employees need tech transfer and training and managers need to know how to take responsibility for the systems. These are all needs that literate programming can help support.

One may argue that transaction systems in more stable environments may have a lesser requirement for literate programming. But, in any knowledge-intensive or knowledge management scenario. the inherent open world nature of knowledge makes something like literate programming an imperative. Everything is in constant flux with a positive need for ongoing updates.

The objective of programmers should not be solely to write code, but to write systems that can be used and re-used to meet desired purposes at acceptable cost. Documentation is integral to that objective. Early experiments need to be improved, codified and documented such that they can be re-implemented across time and environment. Any revision of code needs to be accompanied by a revision or update to documentation. A lines-of-code (LOC) mentality is counter-productive to effective software for knowledge purposes. Literate programming is meant to be the workflow most conducive to achieve these ends.

The Nature of Knowledge

For quite some time now I have made the repeated argument that the nature of knowledge and knowledge management functions compel an emphasis on the open world assumption (OWA) [2]. Though it is a granddaddy of knowledge bases, let’s take the English Wikipedia as an example of why literate programming makes sense for knowledge management purposes. Let’s first look at the nature of Wikipedia itself, and then look at (next section) the various ways it must be processed for KBAI (knowledge-based artificial intelligence) purposes.

The nature of knowledge is that it is constantly changing. We learn new things, understand more about existing things, see relations and connections between things, and find knowledge in other arenas that causes us to re-think what we already knew. Such changes are the definition of an “open world” and pose major challenges to keeping information and systems up to date as these changes constantly flow in the background.

We can illustrate these points by looking at the changes in Wikipedia over a 20-month period from October 2012 to June 2014 [3]. Though growth of Wikipedia has been slowing somewhat since its peak growth years, the kinds of changes seen over this period are fairly indicative of the qualitative and structural changes that constantly affect knowledge.

First, the number of articles in the English Wikipedia (the largest, but only one of the 200+ Wikipedia language versions) increased 12% to over 4.6 million articles over the 20-month period, or greater than 0.6% per month. Actual article changes were greater than this amount. Total churn over the period was about 15.3%, with 13.8% totally new articles and 1.5% deleted articles [3].

Second, even greater changes occurred in Wikipedia’s category structure. More than 25% of the categories were net additions over this period. Another 4% were deleted [3]. Fairly significant categorical changes continue because of a concerted effort by the project to address category problems in the system.

And, third, edits of individual articles continued apace. Over this same period, more than 65 million edits were made to existing English articles, or about 0.75 edits per article per month [3]. Many of these changes, such as link or category or infobox assignments, affect the attributes or characteristics of the article subject, which has a direct effect on KBAI efforts. Also, even text changes affect many of the NLP-based methods for analyzing the knowledge base.

Granted, Wikipedia is perhaps an extreme example given its size and prominence. But the kinds of qualitative and substantive changes we see — new information, deletion of old information, adding or changing specifics to existing information, or changing how information is connected and organized — are common occurrences in any knowledge base.

The implications of working with knowledge bases are clear. KBs are constantly in flux. Single-event, static processing is dated as soon as the procedures are run. The only way to manage and use KB information comes from a commitment to constant processing and updates. Further, with each processing event, more is learned about the nature of the underlying information that causes the processing scripts and methods to need tweaking and refinement. Without detailed documentation of what has been done with prior processing and why, it is impossible to know how to best tweak next steps or to avoid dead-ends or mistakes of the past. KBAI processing can not be cost-effective and responsive without a memory. Literate programming, properly done, provides just that.

The Nature of Systems to Manage Knowledge

Of course, KBAI may also involve multiple input sources, all moving at different speeds of change. There are also multiple steps involved in processing and updating the input information, the “systems”, if you will, required to stage and use the information for artificial intelligence purposes. The artifacts associated with these activities range from functional code and code scripts; to parameter, configuration and build files; to the documentation of those files and scripts; to the logic of the systems; to the process and steps followed to achieve desired results; and to the documentation of the tests and alternatives investigated at any stage in the process. The kicker is that all of these components, without a systematic approach, will need updates, and conventional (non-literate) coding approaches will not be remembered easily, causing costly re-discovery and re-work.

We have tallied up at least ten major steps associated with a processing pipeline for KBAI purposes. I briefly describe each below so to better gain a flavor of this overall flux that needs to be captured by literate programming.

1. Updating Changing Knowledge

The section above dealt with this step, which is ensuring that the input knowledge bases to the overall KBAI process are current and accurate. Depending on the nature of the KM system, there may be multiple input KBs involved, each demanding updates. Besides capturing the changes in the base information itself, many of the steps below may also be required to properly process this changing input knowledge.

2. Processing Input KBs

For KBAI purposes, input KBs must be processed so as to be machine readable. Processing is also desirable to expose features for machine learners and to do other clean up of the input sources, such as removal of administrative categories and articles, cleaning up category structures, consolidating similar or duplicative inputs into canonical forms, and the like.

3. Installing, Running and Updating the System

The KBs themselves require host databases or triple stores. Each of the processing steps may have functional code or scripts associated with it. All general management systems need to be installed, kept current, and secured. The management of system infrastructure sometimes requires a staff of its own, let alone install, deploy, monitoring and update systems.

4. Testing and Vetting Placements

New entities and types added to the knowledge base need to be placed into the overall knowledge graph and tested for logical placement and connections. Though final placements should be manually verified, the sheer number of concepts in the system places a premium on semi-automatic tests and placements. Placement metrics are also important to help screen candidates.

5. Testing and Vetting Mappings

One key aspect of KBAI is its use in interoperating with other datasets and knowledge bases. As a result, new or updated concepts in the KB need to be tested and mapped with appropriate mapping predicates to external or supporting KBs. In the case of UMBEL, Structured Dynamics routinely attempts to map all concepts to Wikipedia (DBpedia), Cyc and Wikidata. Any changes to the base KB causes all of these mappings to be re-investigated and confirmed.

6. Testing and Vetting Assertions

Testing does not end with placements and mappings. Concepts are often characterized by attributes and values; they are often given internal assignments such as SuperTypes; and all of these assertions must be tested against what already exists in the KB. Though the tests may individually be fairly straightforward, there are thousands to test and cross-consistency is important. Each of these assertions is subject to unit tests.

7. Ensuring Completeness

As we have noted elsewhere, our KBAI best practices call for each new concept in the KB to be accompanied by a definition, complete characterization and connections, and synonyms or synsets to aid in NLP tasks. These requirements, too, require scripts and systems for completion.

8. Testing and Vetting Coherence

As the broader structure is built and extended, system tests are applied to ensure the overall graph remains coherent and that outliers, orphans and fragments are addressed. Some of this testing is done via component typologies, and some occurs using various network and graph analyses. Possible problems need to be flagged and presented for manual inspection. Like other manual vetting requirements, confidence scoring and ranking of problems and candidates speed up the screening process.

9. Generating Training Sets

A key objective of the KBAI approach is the creating of positive and negative training sets. Candidates need to be generated; they need to be scored and tested; and their final acceptance needs to be vetted. Once vetted, the training sets themselves may need to be expressed in different formats or structures (such as finite state transducers, one of the techniques we often use) in order for them to be performant in actual analysis or use.

10. Testing and Vetting Learners

Machine learners can then be applied to the various features and training sets produced by the system. Each learning application involves the testing of one or more learners; the varying of input feature or training sets; and the testing of various processing thresholds and parameters (including possibly white and black lists). This set of requirements is one of the most intensive on this listing, and definitely requires documentation of test results, alternatives tested, and other observations useful to cost-effective application.

Rinse and Repeat

Each of these 10 steps is not a static event. Rather, given the constant change inherent to knowledge sources, the entire workflow must be repeated on a periodic basis. In order to reduce the tension between updating effort and current accuracy, the greater automation of steps with complete documentation is essential. A lack of automation leads to outdated systems because of the effort and delays in updates. The imperative for automation, then, is a function of the change frequency in the input KBs.

KBAI, perhaps at the pinnacle of knowledge management services, requires more of these steps and perhaps more frequent updates. But any knowledge management activity will incur a portion of these management requirements.

Yes, Literate Programming is Worth It

As I stated in a prior article in this series [4], “The only sane way to tackle knowledge bases at these structural levels is to seek consistent design patterns that are easier to test, maintain and update. Open world systems must embrace repeatable and largely automated workflow processes, plus a commitment to timely updates, to deal with the constant, underlying change in knowledge.” Literate programming is, we have come to believe, one of the integral ways to keep sane.

The effort to adopt literate programming is justified. But, as Fred noted in one of his recent posts, literate programming does impose a cost on teams and requires some cultural and mindset changes. However, in the context of KBAI, these are not simply nice-to-haves, they are imperatives.

Choice of tools and systems thus becomes important in supporting a literate programming environment. As Fred has noted, he has chosen Org-mode for Structured Dynamics’ literate programming efforts. Besides Org-mode, that also (generally) requires adoption by programmers of the Emacs editor. Both of these tools are a bit problematic for non-programmers.

Though no literate programming tools yet support WOPE (write once, publish everywhere), they can make much progress toward that goal. By “weaving” we can create standalone documentation. With the converter tool Pandoc, we can make (mostly) accurate translations of documents in dozens of formats against one another. The system is open and can be extended. Pandoc works best with lightweight markup formats like Org-mode, Markdown, wikitext, Textile, and others.

We’re still working hard on the tooling infrastructure surrounding literate programming. We like the interactive notebooks approach, and also want easy and straightforward ways to deploy code snippets, demos and interactive Web pages.

Because of the factors outlined in this article, we see a renewed emphasis on literate programming. That, combined with the Web and its continued innovations, would appear to point to a future rich in responsive tooling and workflows geared to the knowledge economy.


[1] Other known open source electronic lab notebook options include Beaker Notebook, Flow, nteract, OpenWetWare, Pineapple, Rodeo, RStudio, SageMathCloud, Session, Shiny, Spark Notebook, and Spyder, among others certainly missed. Terminology for these apps includes notebook, electronic lab notebook, data notebook, and data scientist notebook.
[2] See M. K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.
OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption.
Also, you can search on OWA on this blog.
[3] See Ramakrishna B. Bairi, Mark Carman and Ganesh Ramakrishnan, 2015. “On the Evolution of Wikipedia: Dynamics of Categories and Articles,” Wikipedia, a Social Pedia: Research Challenges and Opportunities: Papers from the 2015 ICWSM Workshop; also, see https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
[4] M.K. Bergman, 2016. “Rationales for Typology Designs in Knowledge Bases,” AI3:::Adaptive Information blog, June 6, 2016.
Posted:June 20, 2016

Photo courtesy of Syndical.comFinding a Natural Classification to Support AI and Machine Learning

Effective use of knowledge bases (KBs) for artificial intelligence (AI) would benefit from a definition and organization of KB concepts and relationships specific to those AI purposes. Like any language, the construction of logical statements within KBAI (knowledge-based artificial intelligence) requires basic primitives for how to express these arguments. Just as in human language where we split our words into roughly nouns and verbs and modifiers and conjunctions of the same, we need a similar primitive vocabulary and basic rules of statement construction to actually begin this process. In all language variants, these basic building blocks are known as the grammar of the language. A well-considered grammar is the first step to being able to construct meaningful and coherent statements about our knowledge bases. The context for how to construct this meaningful grammar needs to be viewed through the lens of the KB’s purpose, which, in our specific case, is for artificial intelligence and machine learning.

In one of my recent major articles I discussed Charles Sanders Peirce and his ideas of the three universal categories and their relation to semiosis [1]. We particularly focused on how Peirce approached categorization and its basis in his logic of semiosis. I’d like to restrict and deepen that discussion a bit, now concentrating on what Peirce called the speculative grammar, the starting point and Firstness of his overall method.

The basic idea of the speculative grammar is simple. What are the vocabulary and relationships that may be involved in the understanding of the question or concept at hand? What is the “grammar” for the question at hand that may help guide how to increase our understanding of it? What are the concepts and terms and relationships that populate our domain of inquiry?

Hearing the term ‘semiosis’ in relation to Peirce most often brings to mind his theory of signs. But for Peirce semiosis was a broader construct still, representing his overall theory of logic and truth-testing. Signs, symbols and representation were but the first part of this theory, the ‘Firstness’ or speculative grammar about how to formulate and analyze logic.

Though he provides his own unique take on it, Peirce’s idea of speculative grammar, which he erroneously ascribed to Duns Scotus, can actually be traced back to the 1300s and the writings of Thomas of Erfurt, one of the so-called Modists of the medieval philosophers [2]. Here is how Peirce in his own words placed speculative grammar in relation to his theory of logic [3]:

“All thought being performed by means of signs, logic may be regarded as the science of the general laws of signs. It has three branches: (1) Speculative Grammar, or the general theory of the nature and meanings of signs, whether they be icons, indices, or symbols; (2) Critic, which classifies arguments and determines the validity and degree of force of each kind; (3) Methodeutic, which studies the methods that ought to be pursued in the investigation, in the exposition, and in the application of truth.” (CP 2:260)

Speculative grammar is thus a Firstness in Peirce’s category structure, with logic methods being a Secondness and the process of logic inquiry, the methodeutic, being a Thirdness.

Still, What Exactly is a Speculative Grammar?

Charles S. Peirce’s view of logic was that it was a formalization of signs, what he termed semiosis. As stated, three legs provide the basis of this formal logic. The first leg is a speculative grammar, in which one strives to capture the signs that most meaningfully and naturally describe the current domain of discourse. The second leg is the means of logical inference, be it deductive, inductive or abductive (hypothesis generating). The third leg is the method or process of inquiry, what came to be known from Perice and others as pragmaticism. The methods of research or science, including the scientific method, result from the application of this logic. The “pragmatic” part arises from how to select what is important and economically viable to investigate among multiple hypotheses.

In Peirce’s universal categories, Firstness is meant to capture the potentialities of the domain at hand, the speculative grammar; Secondness is meant to capture the particular facts or real things of the domain at hand, the critic; and Thirdness is meant to capture methods for discovering the generalities, laws or emergents within the domain, the methodeutic. This mindset can really be applied to any topic, from signs themselves to logic and to science [1]. The “surprising fact” or new insight arising from Thirdness points to potentially new topics that may themselves become new targets for this logic of semiosis.

In its most general sense, Peirce describes this process or method of discovery and explication of new topics as follows [3]:

“. . . introduce the monadic idea of »first« at the very outset. To get at the idea of a monad, and especially to make it an accurate and clear conception, it is necessary to begin with the idea of a triad and find the monad-idea involved in it. But this is only a scaffolding necessary during the process of constructing the conception. When the conception has been constructed, the scaffolding may be removed, and the monad-idea will be there in all its abstract perfection. According to the path here pursued from monad to triad, from monadic triads to triadic triads, etc., we do not progress by logical involution — we do not say the monad involves a dyad — but we pursue a path of evolution. That is to say, we say that to carry out and perfect the monad, we need next a dyad. This seems to be a vague method when stated in general terms; but in each case, it turns out that deep study of each conception in all its features brings a clear perception that precisely a given next conception is called for.” (CP 1.490)

The ideas of Firstness, Secondness and Thirdness in Peirce’s universal categories are not intended to be either sequential or additive. Rather, each interacts with the others in a triadic whole. Each alone is needed, and each is irreducible.

As Peirce says in his Logic of Relatives paper [4]:

“The fundamental principles of formal logic are not properly axioms, but definitions and divisions; and the only facts which it contains relate to the identity of the conceptions resulting from those processes with certain familiar ones.” (CP 3.149)

Without the right concepts, terminology, or bounding — that is, the speculative grammar — it is clearly impossible to properly understand or compose the objects or Secondness that populate the domain at hand. Without the right language and concepts to capture the connections and implications of the domain at hand — again, part of its speculative grammar — it is not possible to discover the generalities or the “surprising fact” or Thirdness of the domain.

The speculative grammar is thus needed to provide the right constructs for describing, analyzing, and reasoning over the given domain. Our logic and ability to understand the focus of our inquiry requires that we describe and characterize the domain of discourse in ways that are properly scoped and related. How well we bound, characterize and signify our problem domains — that is, the speculative grammar — directly relates to how well we can reason and inquire over that space. It very much matters how we describe, relate and define what we analyze and manipulate.

Let’s take a couple of examples to illustrate this. First, imagine van Leeuwenhoek first discovering “animacules” under his early, advanced microscopes. New terms and concepts like flagella, cells, and vacuoles needed to be coined and systematized in order for further advances in microorganisms to be described. Or, second, imagine “action at a distance” phenomena such as magnetic repulsion or static electricity causing hair to stand on end. For centuries these phenomena were assumed to be caused by atomistic particles too small to see or discover. Only when Hertz was able to prove Maxwell‘s equations of electromagnetism hundreds of years later in the mid-1800s were the concepts and vocabulary of waves and fields sufficiently developed to begin to unravel electromagnetic theory in earnest. Progress required the right concepts and terminology.

For Peirce, the triadic nature of the sign — and its relation between the sign, its object and its interpretant — was the speculative grammar breakthrough that then allowed him to better describe the process of signmaking and its role in the logic of inquiry and truth-testing (semiosis). Because he recognized it in his own work, Peirce understood a conceptual “grammar” appropriate to the inquiry at hand is essential to further discovery and validation.

How Might a Speculative Grammar Apply to Knowledge Bases?

Perhaps one way to understand what is intended by a speculative grammar is to define one. In this instance, let’s aim high and posit a grammar for knowledge bases.

Since we had been moving steadily to a typology design for our entities and were looking at all other aspects of structure and organization of the knowledge base (KB) [5], we decided to apply this idea of a speculative grammar to the quest. We consciously chose to do this from two perspectives. First, in keeping with Peirce’s sign trichotomy, we wanted to keep the interpretant of an agent doing artificial intelligence front-and-center. This meant that the evaluative lens we wanted to apply to how we conceptualized, organized and provided a vocabulary for the knowledge space was to be done from the viewpoint of machine learning and its requirements. Once we posit the intelligent agent as the interpretant, the importance of a rich vocabulary and text (NLP), a well-formed structure and hierarchy (logical inference), and a rich feature set (structure and coherent characterizations), becomes clear.

Second, we wanted to look at the basis for organizing the KB concepts into the Peircean mindset of Firstness (potentials), Secondness (particulars) and Thirdness (generals). Our hypothesis was that conforming to Peirce’s trichotomous splits would help guide us in deciding the myriad possibilities of how to arrange and structure a knowledge base.

We looked as well as to what language to write these specifications. Our current languages of RDF, SKOS and OWL could capture all first-order logic imperatives, but the OWL annotation, object and datatype properties did not exactly conform to the splits we saw [11]. On the other hand, tooling such as Protégé and the OWL API were immensely helpful, and there are quite a few supporting tools. Formalisms like conceptual graphs were richer and handled higher-order logics, but also lacked tooling and widespread use. Since we knew we could adapt to OWL, we stuck with our original language set.

Guarino, in some of the earliest (1992) writings leading to semantic technologies, had posited knowledge bases split into concepts, attributes and relations [6]. This was close to our thinking, and provided comfort that such splits were also being considered from the earliest days of the semantic Web. Besides Peirce, we studied many philosophers across history regarding fundamental concepts in knowledge organization. Aristotle’s categories were influential, and have mostly stood the test of time and figured prominently in our thinking. We also reviewed efforts such as Sarbo’s to apply Peirce to knowledge bases [7], as well as most other approaches claiming Peirce in relation to KBs that we could discover [8-10].

Because the intent was to create a feature-rich logic machine for AI, we of course wanted the resulting system to be coherent, sound, consistent, and relatively complete. Though the intended interpretant is artificial agents, training them and vetting results still must be overseen by humans in order to ensure quality. We thus wanted all features and structural aspects to be described and labeled sufficiently to be understood and interpreted by humans. These aspects further had to be suitable for direct translation into any human language. For interchange purposes, we try to use canonical forms.

Once these preliminaries are out of the way, the task at hand can finally focus on the fundamental divisions within the knowledge base. In accordance with the Peircean categories, we saw these splits:

  • Firstness — these ‘potentials’ include base concepts, attributes, and relations in the abstract
  • Secondness — these are the ‘particular’ things in the domain, including entities, events and activities
  • Thirdness — the ‘generals’ in the knowledge base include classes, types, topics and processes.

Ultimately, features richness was felt to be of overriding importance, with features explicitly understood to include structure and text.

A Speculative Grammar for Knowledge Bases

With these considerations in mind, we are now able to define the basic vocabulary of our knowledge base, one of the first components of the speculative grammar. This base vocabulary is:

  • Attributes are the ways to characterize the entities or things within the knowledge base; while the attribute values and options may be quite complex, the relationship is monadic to the subject at hand. These are intensional properties of the subject
  • Relations are the way we describe connections between two or more things; relations are external-facing, between the subject and another entity or concept; relations set the extensional structure of the knowledge graph [11]
  • Entities are the basic, real things in our domain of interest; they are nameable things or ideas that have identity, are defined in some manner, can be referenced, and should be related to types; entities are the bulk of the overall knowledge base
  • Events are nameable sequences of time, are described in some manner, can be referenced, and may be related to other time sequences or types
  • Activities are sustained actions over durations of time; activities may be organized into natural classes
  • Types are the hierarchical classification of natural kinds within all of the terms above
  • The Typology structure is not only a natural organization of natural classes, but it enables flexible interaction points with inferencing across its ‘accordion-like’ design (see further [5])
  • Base concepts are the vocabulary to the grammar and top-level concepts in the knowledge graph, organized according to Peircean-informed categories
  • Annotations are indexes and the metadata of the KB; these can not be inferenced over. But, they can be searched and language features can be processed in other ways.

How these vocabulary terms relate to one another and the overall knowledge base is shown by this diagram:

Knowledge Base Grammar

A Knowledge Base Grammar

As part of the ongoing simplification of the TBox [12], we need to be able to distinguish and rationalize the various typologies used in the system: attributes, relations, entities, events and activities. Here are some starting rules:

  • RULE: Entities can not be topics or types
  • RULE: Entities are not data types; these are handled under values processing
  • RULE: Events are like entities, except they have a discrete time beginning and end; they may be nameable
  • RULE: Activities (or actions) act upon entities, but do not require a discrete time beginning or end
  • RULE: Attributes are not metadata; they are characteristics or descriptors of an entity
  • RULE: Topics or base concepts do not have attributes
  • RULE: Entity types have the attributes of all type members
  • RULE: Relation types do not have attributes (also, relations do not have attributes)
  • RULE: Attribute types do not have attributes (also, attributes do not have attributes)
  • RULE: All types may have hierarchy
  • RULE: No attributes are provided on relations (as in E-R modeling), just annotations
  • RULE: Annotations are not typed, and can not be inferred over.

As we work further with this structure, we will continue to add to and refine these governing rules.

The columns in the figure above also roughly correspond to Peirce’s three universal categories. The first column and part of the second (attributes and relations) correspond to Firstness; the remainder of the second column corresponds to Secondness; and the third column corresponds to Thirdness. I’ll discuss these distinctions further in a later article.

In combination, this vocabulary and rules set, as allocated to Peirce’s categories, constitutes the current speculative grammar for our knowledge bases.

Recall that the interpretants for this design are artificial agents. It is unclear how the resulting structure will be embraced by humans, since we were not the guiding interpretant. But, like being able to readily discern whether an object is plumb or level, humans have the ability to recognize adaptive structure. I think what we are building here will therefore withstand scrutiny and be useful to all intelligent agents, artificial or human.

Conclusion

Generating new ideas and testing the truth of them is a logical process that can be formalized. Critical to this process is the proper bounding, definition and vocabulary upon which to conduct the inquiries. As Charles Peirce argued, the potentials central to the inquiries for a given topic need to be expressed through a suitable speculative grammar to make these inquiries productive. How we think about, organize and define our problem spaces is central to that process.

The guiding lens for how we do this thinking comes from the purpose or nature of the inquiries at hand. In the case of machine learning applied to knowledge bases, this lens, I have argued, needs to be grounded in Peirce’s categories of Firstness, Secondness and Thirdness, all geared to feature generation upon which machine learners may operate. The structure of the system should also be geared to enable (relatively quick and cheap) creation of positive and negative training sets upon which to train the learners. In the end, the nature of how to structure and define knowledge bases depends upon the uses we intend them to fulfill.


[1] M.K. Bergman, 2016. “A Foundational Mindset: Firstness, Secondness, Thirdness,” AI3:::Adaptive Information blog, March 21, 2016.
[2] Alessandro Isnenghi, 2008. “A Semiótica de CS Peirce e a Gramática Especulativa de Modistae” (or, “C.S. Peirce’s Semiotic And Modistae’s Grammatica Speculativa“), Cognitio-Estudos: revista eletrônica de filosofia. ISSN 1809-8428 5, no. 2 (2008).
[3] See the electronic edition of The Collected Papers of Charles Sanders Peirce, reproducing Vols. I-VI, Charles Hartshorne and Paul Weiss, eds., 1931-1935, Harvard University Press, Cambridge, Mass., and Arthur W. Burks, ed., 1958, Vols. VII-VIII, Harvard University Press, Cambridge, Mass. The citation scheme is volume number using Arabic numerals followed by section number from the collected papers, shown as, for example, CP 1.208.
[4]  This quote is drawn from Charles Sanders Peirce, 1870. “Description of a Notation for the Logic of Relatives, Resulting from an Amplification of the Conceptions of Boole’s Calculus of Logic”, Memoirs of the American Academy of Arts and Sciences 9 (1870), 317–378 (the “Logic of Relatives”), using the same numbering as from [3].
[5] M.K. Bergman, 2016. “Rationales for Typology Designs in Knowledge Bases,” AI3:::Adaptive Information blog, June 6, 2016.
[6] Nicola Guarino, 1992. “Concepts, Attributes and Arbitrary Relations: Some Linguistic and Ontological Criteria for Structuring Knowledge Bases.” Data & Knowledge Engineering 8, no. 3 (1992): 249-261. Continuing in the same vein, see also Nicola Guarino, 1997. “Some Organizing Principles for a Unified Top-level Ontology,” in AAAI Spring Symposium on Ontological Engineering, pp. 57-63. 1997. Also, for a general view of ontology at that time, see Thomas R. Gruber, 1993. “A Translation Approach to Portable Ontology Specifications.” Knowledge Acquisition 5, no. 2 (1993): 199-220.
[7] Auke JJ Van Breemen and Janos J. Sarbo, 2009. “The machine in the ghost: The syntax of mind.” Signs-International Journal of Semiotics 3 (2009): 135-184. and Janos J. Sarbo and József I. Farkas, 2002. “On the isomorphism of signs, logic and language.”
[8] Lehmann, Fritz, and Rudolf Wille. “A triadic approach to formal concept analysis“. Springer Berlin Heidelberg, 1995.
[9] József István Farkas, 2008. “A Semiotically Oriented Cognitive Model of Knowledge Representation,” Ph.D. thesis, Radboud University of Nijmegen, April 23, 2008.
[10] John F. Sowa, 1995. “Top-level Ontological Categories,” International Journal of Human-computer Studies 43, no. 5 (1995): 669-685.
[11] Attributes, Relations and Annotations comprise OWL properties. In general, Attributes correspond to the OWL datatypes property; Relations to the OWL object property; and Annotations to the OWL annotation property. These specific OWL terms are not used in our speculative grammar, however, because some attributes may be drawn from controlled vocabularies, such as colors or shapes, that can be represented as one of a list of attribute choices. In these cases, such attributes are defined as object properties. Nonetheless, the mappings of our speculative grammar to existing OWL properties is quite close.
[12] As I earlier wrote, “Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.” In the diagram above, the middle column represents particulars, or the ABox components. The definition of items and the first and third columns represent TBox components.
Posted:June 13, 2016

Unveiling a New ‘Timeline of Information History’

AI3 Blog

Since I first posted it eight years ago, one of the popular features of my blog has been its Timeline of Information History. However, I’m embarrassed to say that the interactive JavaScript application has been broken now for some months. I had fixed the JS once before about four years ago, but some recent change to WordPress (or one of my plugins) again caused a JavaScript conflict of some nature. After a couple of my own attempts to fix it, I called in the cavalry, Fred Giasson, to provide more professional attention. I’m sure the application was eventually fixable, but tracking down fixes to undocumented code is ultimately a fool’s game (or for those with too much time on their hands), so it wasn’t worth further investment to understand the older Timeline JS code, including dependencies on outdated libraries. So, these past few months, the Timeline remained broken.

Until today. I found another timeline on the Web, Timeline JS3, that offers a hosted version driven by a Google spreadsheet. It operates differently and has a different look-and-feel, but it captures all of the original source information of the earlier Timeline and has a charm all its own:

New Timeline of Information History

I encourage you to investigate this new Timeline of Information History.

Installation Overview

TimelineJS (or JS3 in its latest version) is an open source and cloud-hosted timeline from the Knight Lab at Northwestern University. They provide a plugin to WordPress, with calls via either shortcodes or a simple PHP statement. They provide a dead-simple online form for generating your own timeline quickly, which they then host. If you prefer to fiddle with all of the dials and knobs, Knight Labs also offers code and clear instructions for calling and using the libraries locally. Their lead is “Easy-to-make, beautiful timelines,” and I agree.

In my own configuration of WordPress my Timeline of Information History is not a post, but a separate page with its own PHP template. None of the direct WordPress options suggested by Knight Labs worked for me, including the plugin, but I was able to post the code generated by the online form directly into my current PHP template. (My guess is that simple WordPress blog sites and posts would work with their standard set-up.) If there are additional style changes I want to make down the road, I will need to bring in the Timeline JS3 libraries and host them locally (there is a small suite of canned, online options; further CSS or style changes require that you have your own local install.) For now, I am deferring this possible step.

With a bit of fiddling, it was not too difficult to convert the existing XML format of my prior Timeline to the CSV (Google spreadsheet with fixed columns) format of Timeline JS3 (which actually uses the JSON generated from the spreadsheet). I also did some editing and link corrections while doing the migration. It had been a while since I had used the Google online spreadsheet. I found it much smoother and responsive than my last uses of it. Copying-and-pasting across environments also works nicely.

The <iframe> code generated by the online form worked as is when embedded in my existing PHP template. The instructions should you want to install the libraries and host locally appeared equally clear. In all, my experience installing the new Timeline and migrating its data was pleasingly smooth. But, no matter what, code, environment and data changes do take some effort. Now that it is done, I am happy with the results.

The End of My Simile Era

This marks the end of nearly a decade of use on my blog of the Simile tools from David Karger‘s shop at MIT. Two of this blog’s former tools, Exhibit and Timeline, were developed by the skilled and innovative David Huynh. David was a toolmaking fool during his MIT tenure, including Sifter and Solvent and Potluck, besides Exhibit and Timeline. After he went to Metaweb (and then Google) he developed the data wrangling tool Gridworks, which was renamed Google Refine and then Open Refine when it went open source. Any one of these tools is a notable contribution; the innovations across the whole body of Huynh’s work are quite remarkable.

I was the first to install Exhibit on WordPress outside of MIT in 2007. I installed Timeline for the Timeline of Information History in 2008. I upgraded both as libraries were updated and WP changed; especially difficult was WP’s incorporation of more JavaScript, causing JS conflicts. Though I had to retire Exhibit for these reasons in 2011, I was able to keep Timeline going until just recently. With today’s retirement of Timeline, the era of my blog’s working with Simile tools comes to an end.

I’d like to thank the Davids, Larry and others at the Simile program that made so many contributions at a pivotal time in the semantic Web and Web of data. Good job, all! We won’t soon see that era again.

WordPress is Likely Next on the Chopping Block

I have been a WordPress user, self-hosted, since Day One of this blog. My first big blog effort was a comprehensive guide to blogging with WP [1]. I’ve spent many years with WP, but it is distinctly feeling long in the tooth.

These JS conflicts that appear out of nowhere are, unfortunately, not an uncommon occurrence with WordPress. I routinely have spam, bot, and hacker attacks, probing virtually every area of my blog and trying to break into my administrative area of WP. My counters for fending off malicious probes and attacks are in the millions.

For years, my site sometimes has race conditions that I have been unable to track down and correct. There are certainly conflicts between plugins, but the relative rarity and what exactly the conditions are to trigger the race conditions have yet to be discovered. Periodically trying to track down the root issue has been frustrating. Truthfully, I still do not know what they are.

Web and UI design for the Web has evolved substantially, especially over the past years of growth and then dominance of mobile. In the case of this blog, AI3, which is solely authored and managed by me, I also do not need many of the aspects of a full-throated blog or publishing site.

WordPress also has poor performance and difficulties optimizing performance. Tuning Web servers is a nightmare for the amateur. I have tried various caching and CDN strategies. While performance can be improved with these strategies, they come at a cost of their own glitches and added complexity.

These factors have caused me to look closely at static site generators. In my style of blogging, I make many edits while drafting and may have multiple drafts being worked on over weeks. But, once published, there are not many edits. From the standpoint of the Web, reading can be emphasized over writing.

Yet, despite its simplicity and better performance, there are also challenges in workflow and tooling in going to a static site design. There is also the daunting question of how best to convert my existing WP data to an accurate, readable form without massive manual changes. I would not make the change to a different blogging platform without carrying my prior writings forward. Looked at in this way, my WordPress continues to work with the occasional burst of effort, like this current one in switching out timelines. I’m still mulling the more fundamental change away from WP. The chicken is clucking the closer I get to the chopping block.


[1] An archival version of this document, Comprehensive Guide to a Professional Blog Site: A WordPress Example, can be downloaded in PDF. But realize this guide is now more than 11 years old, and not much is likely applicable to the current Web.

Posted by AI3's author, Mike Bergman Posted on June 13, 2016 at 8:28 am in Blogs and Blogging, Site-related | Comments (3)
The URI link reference to this post is: https://www.mkbergman.com/1956/a-fond-but-overdue-farewell/
The URI to trackback this post is: https://www.mkbergman.com/1956/a-fond-but-overdue-farewell/trackback/
Posted:June 6, 2016

An 'Accordion-like' DesignDesign is Aimed to Improve Computability

In the lead up to our most recent release of UMBEL, I began to describe our increasing reliance on the use of typologies. In this article, I’d like to expand on our reasons for this design and the benefits we see.

‘Typology’ is not a common term within semantic technologies, though it is used extensively in such fields as archaeology, urban planning, theology, linguistics, sociology, statistics, psychology, anthropology and others. In the base semantic technology language of RDF, a ‘type’ is what is used to declare an instance of a given class. This is in keeping with our usage, where an instance is a member of a type.

Strictly speaking, ‘typology’ is the study of types. However, as used within the fields noted, a ‘typology’ is the result of the classification of things according to their characteristics. As stated by Merriam Webster, a ‘typology’ is “a system used for putting things into groups according to how they are similar.” Though some have attempted to make learned distinctions between typologies and similar notions such as classifications or taxonomies [1], we think this idea of grouping by similarity is the best way to think of a typology.

In Structured Dynamics‘ usage as applied to UMBEL and elsewhere, we are less interested in the sense of ‘typology’ as comparisons across types and more interested in the classification of types that are closely related, what we have termed ‘SuperTypes’. In this classification, each of our SuperTypes gets its own typology. The idea of a SuperType, in fact, is exactly equivalent to a typology, wherein the multiple entity types with similar essences and characteristics are related to one another via a natural classification. I speak elsewhere how we actually go about making these distinctions of natural kinds [2].

In this article, I want to stand back from how a typology is constructed to deal more about their use and benefits. Below I discuss the evolution of our typology design, the benefits that accrue from the ‘typology’ approach, and then conclude with some of the application areas to which this design is most useful. All of this discussion is in the context of our broader efforts in KBAI, or knowledge-based artificial intelligence.

Evolution of the Design

I wish we could claim superior intelligence or foresight in how our typology design came about, but it was truthfully an evolution of needing to deal with pragmatic concerns in our use of UMBEL over the past near-decade. The typology design has arisen from the intersection of: 1) our efforts with SuperTypes, and creating a computable structure that uses powerful disjoint assertions; 2) an appreciation of the importance of entity types as a focus of knowledge base terminology; and 3) our efforts to segregate entities from other key constructs of knowledge bases, including attributes, relations and annotations. Though these insights may have resulted from serendipity and practical use, they have brought a new understanding of how best to organize knowledge bases for artificial intelligence uses.

The Initial Segreation into SuperTypes

We first introduced SuperTypes into UMBEL in 2009 [3]. The initiative arose because we observed about 90% of the concepts in UMBEL were disjoint from one another. Disjoint assertions are computationally efficient and help better organize a knowledge graph. To maximize these benefits we did both top-down and bottom-up testing to derive our first groupings of SuperTypes into 29 mostly disjoint types, with four non-disjoint (or cross-cutting and shared) groups [3]. Besides computational efficiency and its potential for logical operations, we also observed that these SuperTypes could also aid our ability to drive display widgets (such as being able to display geolocational types on maps).

All entity classes within a given SuperType are thus organized under the SuperType (ST) itself as the root. The classes within that ST are then organized hierarchically, with children classes having a subClassOf relation to their parent. By the time of UMBEL’s last release [4], this configuration had evolved into the following 31 largely disjoint SuperTypes, organized into 10 or so clusters or “dimensions”:

Constituents
Natural Phenomena
Area or Region
Location or Place
Shapes
Forms
Situations
Time-related
Activities
Events
Times
Natural Matter
Atoms and Elements
Natural Substances
Chemistry
Organic Matter
Organic Chemistry
Biochemical Processes
Living Things
Prokaryotes
Protists & Fungus
Plants
Animals
Diseases
Agents
Persons
Organizations
Geopolitical
Artifacts
Products
Food or Drink
Drugs
Facilities
Information
Audio Info
Visual Info
Written Info
Structured Info
Social
Finance & Economy
Society
Current SuperType Structure of UMBEL

We also used the basis in SuperTypes to begin cleaving UMBEL into modules, with geolocational types being the first to be separated. We initially began splitting into modules as a way to handle UMBEL’s comparatively large size (~ 30K concepts). As we did so, however, we also observed that most of the SuperTypes could be also be segregated into modules. This architectural view and its implications were another reason leading to the eventual typology design.

A Broadening Appreciation for the Pervasiveness of Entity Types

The SuperType tagging and possible segregation of STs into individual modules led us to review other segregations and tags. Given that the SuperTypes were all geared to entities and entity types — and further represented about 90% of all concepts in UMBEL — we began to look at entities as a category with more care and attention. This analysis took us back to the beginnings of entity recognition and tagging in natural language processing. We saw the progression of understanding from named entities and just a few entity types, to the more recent efforts in so-called fine-grained entity recognition [5].

What was blatantly obvious, but which had been previously overlooked by us and other researchers investigating entity types, was that most knowledge graphs (or upper ontologies) were themselves made of largely entity types [5]. In retrospect, this should not be surprising. Most knowledge graphs deal with real things in the world, which, by definition, tend to be entities. Entities are the observable, often nameable, things in the world around us. And how we organize and refer to those entities — that is, the entity types — constitutes the bulk of the vocabulary for a knowledge graph.

We can see this progression of understanding moving from named entities and fine-grained entity types, all the way through to an entity typology — UMBEL’s SuperTypes — that then becomes the tie-in point for individual entities (the ABox):


Evolving Sophistication of Entity Types

Evolving Sophistication of Entity Types

The key transition is moving from the idea of discrete numbers of entity types to a system and design that supports continuous interoperability through an “accordion-like” typology structure.

The General Applicability of ‘Typing’ to All Aspects of Knowledge Bases

The “type-orientation” of a typology was also attractive because it offers a construct that can be applied to all other (non-entity) parts of the knowledge base. Actions can be typed; attributes can be typed; events can be typed; and relations can be typed. A mindset around natural kinds and types helps define the speculative grammar of KBAI (a topic of a next article). We can thus represent these overall structural components in a knowledge base as:

Knowledge Base Grammar

Typology View of a Knowledge Base

The shading is in reference to that which is external to the scope of UMBEL.

The intersection of these three factors — SuperTypes, an “accordion” design for continuous entity types, and overall typing of the knowledge base — set the basis for how to formalize an overall typology design.

Formalizing the Typology Design

We have first applied this basis to typologies for entities, based on the SuperTypes. Each SuperType becomes its own typology with natural boundaries and a hierarchical structure. No instances are allowed in the typology; only types.

Initial construction of the typology first gathers the relevant types (concepts) and automatically evaluates those concepts for orphans (unconnected concepts) and fragments (connected portions missing intermediary parental concepts). For the initial analysis, there are likely multiple roots, multiple fragments, and multiple orphans. We want to get to a point where there is a single root and all concepts in the typology are connected. Source knowledge bases are queried for the missing concepts and evaluated again in a recursive manner. Candidate placements are then written to CSV files and evaluated with various utilities, including crucially manual inspection and vetting. (Because the system bootstraps what is already known and structured in the system, it is important to build the structure with coherent concepts and relations.)

Once the overall candidate structure is completed, it is then analyzed against prior assignments in the knowledge base. ST disjoint analysis, coherent inferencing, and logical placement tests again prompt the creation of CSV files that may be viewed and evaluated with various utilities, but, again, ultimately manually vetted.

The objective of the build process is a fully connected typology that passes all coherency, consistency, completeness and logic tests. If errors are subsequently discovered, the build process must be run again with possible updates to the processing scripts. Upon acceptance, each new type added to a typology should pass a completeness threshold, including a definition, synonyms, guideline annotations, and connections. The completed typology may be written out in both RDF and CSV formats. (The current UMBEL and its typologies are available here.)

Integral to the design must be build, testing and maintenance routines, scripts, and documentation. Knowledge bases are inherently open world [6], which means that the entities and their relationships and characteristics are constantly growing and changing due to new knowledge underlying the domain at hand. Such continuous processing and keeping track of the tests, learnings and workflow steps place a real premium on literate programming, about which Fred Giasson, SD’s CTO, is now writing [7].

Because of the very focused nature of each typology (as set by its root), each typology can be easily incorporated or excised from a broader structure. Each typology is rather simple in scope and simple in structure, given its hierarchical nature. Each typology is readily maintained and built and tested. Typologies pose relatively small ontological commitments.

Benefits of the Design

The simple bounding and structure of the typology design makes each typology understandable merely by inspecting its structure. But the typologies can also be read into programs such as Protégé in order to inspect or check complete specifications and relationships.

Because each typology is (designed to be) coherent and consistent, new concepts or structures may be related to any part of its hierarchical design. This gives these typologies an “accordion-like” design, similar to the multiple levels and aggregation made possible by an accordion menu:

An 'Accordion'-like Design

An ‘Accordion’ Design Accommodates Different Granularities

The combination of logical coherence with a flexible, accordion structure gives typologies a unique set of design benefits. Some have been mentioned before, but to recap they are:

Computable

Each type has a basis — ranging from attributes and characteristics to hierarchical placement and relationship to other types — that can inform computability and logic tests, potentially including neighbor concepts. Ensuring that type placements are accurate and meet these tests means that the now-placed types and their attributes may be used to test the placement and logic of subsequent candidates. The candidates need not be only internal typology types, but may also be used against external sources for classification, tagging or mapping.

Because the essential attributes or characteristics across typologies in an entire domain can differ broadly — such as entities v attributes, living v inanimate things, natural things v man-made things, ideas v physical objects, etc. — it is possible to make disjointedness assertions between entire groupings of natural classes. Disjoint assertions, combined with logical organization and inference, provide a typology design that lends itself to reasoning and tractability.

The internal process to create these typologies also has the beneficial effect of testing placements in the knowledge graph and identifying gaps in the structure as informed by fragments and orphans. This computability of the structure is its foundational benefit, since it determines the accuracy of the typology itself and drives all other uses and services.

Pluggable and Modular

Since each typology has a single root, it is readily plugged into or removed from the broader structure. This means the scale and scope of the overall system may be easily adjusted, and the existing structure may be used as a source for extensions (see next). Unlike more interconnected knowledge graphs (which can have many network linkages), typologies are organized strictly along these lines of shared attributes, which is both simpler and also provides an orthogonal means for investigating type-class membership.

Interoperable

The idea of nested, hierarchical types organized into broad branches of different typologies also provides a very flexible design for interoperating with a diversity of world views and degrees of specificity. A typology design, logically organized and placed into a consistent grounding of attributes, can readily interoperate with these different world views. So far, with UMBEL, this interoperable basis is limited to concepts and things, since only the entity typologies have been initially completed. But, once done, the typologies for attributes and relations will extend this basis to include full data interoperability of attribute:value pairs.

Extensible

A typology design for organizing entities can thus be visualized as a kind of accordion or squeezebox, expandable when detail requires, or collapsed to more coarse-grained when relating to broader views. The organization of entity types also has a different structure than the more graph-like organization of higher-level conceptual schema, or knowledge graphs. In the cases of broad knowledge bases, such as UMBEL or Wikipedia, where 70 percent or more of the overall schema is related to entity types, more attention can now be devoted to aspects of concepts or relations.

Each class within the typology can become a tie-in point for external information, providing a collapsible or expandable scaffolding (the ‘accordion’ design). Via inferencing, multiple external sources may be related to the same typology, even though at different levels of specificity. Further, very detailed class structures can also be accommodated in this design for domain-specific purposes. Moreover, because of the single tie-in point for each typology at its root, it is also possible to swap out entire typology structures at once, should design needs require this flexibility.

Testable and Maintainable

The only sane way to tackle knowledge bases at these structural levels is to seek consistent design patterns that are easier to test, maintain and update. Open world systems must embrace repeatable and largely automated workflow processes, plus a commitment to timely updates, to deal with the constant, underlying change in knowledge.

Listing of Broad Application Areas

Some of the more evident application areas for this design — and in keeping with current client and development activities for Structured Dynamics — are the following:

  • Domain extension — the existing typologies and their structure provide a ready basis for adding domain details and extensions;
  • Tagging — there are many varieties of tagging approaches that may be driven from these structures, including, with the logical relationships and inferencing, ontology-based information tagging;
  • Classification — the richness of the typology structures means that any type across all typologies may be a possible classification assignment when evaluating external content, if the overall system embracing the typologies is itself coherent;
  • Interoperating datasets — the design is based on interoperating concepts and datasets, and provides more semantic and inferential means for establishing MDM systems;
  • Machine learning (ML) training — the real driver behind this design is lowering the costs for supervised machine learning via more automated and cost-effective means of establishing positive and negative training sets. Further, the system’s feature richness (see next) lends itself to unsupervised ML techniques as well; and
  • Rich feature set — a design geared from the get-go to emphasize and expose meaningful knowledge base features [8] perhaps opens up many new fruitful avenues for machine learning and other AI. More expressed structure may help in the interpretability of latent feature layers in deep learning. In any case, more and coherent structure with testability can only be goodness for KBAI going forward.

One Building Block Among Many

The progressions and learning from the above were motivated by the benefits that could be seen with each structural change. Over nearly a decade, as we tried new things, structured more things, we discovered more things and adapted our UMBEL design accordingly. The benefits we see from this learning are not just additive to benefits that might be obtained by other means, but they are systemic. The ability to make knowledge bases computable — while simultaneously increasing the features space for training machine learners — at much lower cost should be a keystone enabler at this particular point in AI’s development. Lowering the costs of creating vetted training sets is one way to improve this process. 

Systems that can improve systems always have more leverage than individual innovations. The typology design outlined above is the result of the classification of things according to their shared attributes and essences. The idea is that the world is divided into real, discontinuous and immutable ‘kinds’. Expressed another way, in statistics, typology is a composite measure that involves the classification of observations in terms of their attributes on multiple variables. In the context of a global KB such as Wikipedia, about 25,000 entity types are sufficient to provide a home for the millions of individual articles in the system.

As our next article will discuss, Charles Sanders Peirce’s consistent belief that the real world can be logically conceived and organized provides guidance for how we can continue to structure our knowledge bases into computable form. We now have a coherent base for treating types and natural classes as an essential component to that thinking. These insights are but one part of the KB innovations suggested by Peirce’s work.


[1] See, for example, Alberto Marradi, 1990. “Classification, Typology, Taxonomy“, Quality & Quantity 24, no. 2 (1990): 129-157.
[2] M.K. Bergman, 2015. “‘Natural Classes’ in the Knowledge Web,” AI3:::Adaptive Information blog, July 13, 2015.
[3] M.K. Bergman, 2009. “‘SuperTypes’ and Logical Segmentation of Instances,” AI3:::Adaptive Information blog, September 2, 2009.
[4] umbel.org, “New, Major Upgrade of UMBEL Released,” UMBEL press release, May 10, 2016 (for UMBEL v. 1.50 release)
[5] M.K. Bergman, 2016. “How Fine Grained Can Entity Types Get?,” AI3:::Adaptive Information blog, March 8, 2016.
[6] M.K. Bergman, 2012. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009.
[8] M.K. Bergman, 2015. “A (Partial) Taxonomy of Machine Learning Features,” AI3:::Adaptive Information blog, November 23, 2015.