I was pleasantly surprised to discover that a diversity of my writings has been chosen for the syllabus for a text analytics course by Dr. Alianna Maren in Northwestern University’s master program in predictive analytics. Dr. Maren has chosen to feature some of my writings in NLP statistics, ontologies, and the open world assumption.
Dr. Maren has stated her intention is to present text analytics from a “topdown ontological perspective.” The syllabus looks very interesting.
I appreciate the recognition and wish the students and Dr. Maren a great course!
]]>Semantics is a funny thing. All professionals come to know that communication with their peers and outside audiences requires accuracy in how to express things. Yet, even with such attentiveness, communications sometimes go awry. It turns out that background, perspective and context can all act to switch circuits at the point of communication. Despite, and probably because of, our predilection as a species to classify and describe things, all from different viewpoints, we can often exhort in earnest a thought that is communicated to others as something different from what we intended. Alas!
This reality is why, I suspect, we have embraced as a species things like dictionaries, thesauri, encyclopedias, specifications, standards, sacred tracts, and such, in order to help codify what our expressions mean in a given context. So, yes, while sometimes there is sloppiness in language and elocution, many misunderstandings between parties are also a result of difference in context and perspective.
It is important when we process information in order to identify relations or extract entities, to type them or classify them, or to fill out their attributes, that we have measures to gauge how well our algorithms and tests work, all attentive to providing adequate context and perspective. These very same measures can also tell us whether our attempts to improve them are working or not. These measures, in turn, also are the keys for establishing effective gold standards and creating positive and negative training sets for machine learning. Still, despite their importance, these measures are not always easy to explain or understand. And, truth is, sometimes these measures may also be misexplained or miscalculated. Aiding the understanding of important measures in improving the precision, completeness, and accuracy of communications is my purpose in this article.
The most common scoring methods for gauging the “accuracy” of natural language communications involves statistical tests based on the nomenclature of negatives and positives, true or false. Sometimes it can be a bit confusing about how to interpret these terms, a confusion which can be made all the more difficult in what kind of statistical environment is at play. Let me try to first confuse, and then more simply explain these possible nuances.
Standard science is based on a branch of statistics known as statistical hypothesis testing. This is likely the statistics that you were taught in school. In hypothesis testing, we begin with a hypothesis about what might be going on with respect to a problem or issue, but for which we do not know the cause or truth. After reviewing some observations, we formulate a hypothesis that some factor A is affecting or influencing factor B. We then formulate a mirrorimage null hypothesis that specifies that factor A does not affect factor B; this is what we will actually test. The null hypothesis is what we assume the world in our problem context looks like, absent our test. If the test of our formulated hypothesis does not affect that assumed distribution, then we reject our alternative (meaning our initial hypothesis fails, and we keep the null explanation).
We make assumptions from our sample about how the entire population is distributed, which enables us to choose a statistical model that captures the shape of assumed probable results for our measurement sample. These shapes or distributions may be normal (bellshaped or Gaussian), binomial, power law, or many others. These assumptions about populations and distribution shapes then tell us what kind of statistical test(s) to perform. (Misunderstanding the true shape of the distribution of a population is one of the major sources of error in statistical analysis.) Different tests may also give us more or less statistical power to test the null hypothesis, which is that chance results will match the assumed distribution. Different tests may also give us more than one test statistic to measure variance from the null hypothesis.
We then apply our test and measure and collect our sample from the population, with random or other statistical sampling important so as not to skew results, and compare the distribution of these results to our assumed model and test statistic(s). The null hypothesis is confirmed or not by whether the shape of our sampled results matches the assumed distribution or not. The significance of the variance from the assumed shape, along with a confidence interval based on our sample size and the test at hand, provides the information necessary to either accept or reject the null hypothesis.
Rejection of the null hypothesis generally requires both significant difference from the expected shape in our sample and a high level of confidence. Absent those results, we likely need to accept the null hypothesis, thus rejecting the alternative hypothesis that some factor A is affecting or influencing factor B. Alternatively, with significant differences and a high level of confidence, we can reject the null hypothesis, thereby accepting the alternative hypothesis (our actual starting hypothesis, which prompted the null) that factor A is affecting or influencing factor B.
This is all well and good except for the fact that either the sampling method or our test may be in error. There are two types of errors that are possible: Type I errors, where a positive result corresponds to rejecting the null hypothesis; and Type II errors, where a negative result corresponds to not rejecting the null hypothesis.
We can combine all of these thoughts into what is the standard presentation for capturing these true and false, positive and negative, results [1]:
Null hypothesis (H_{0}) is  

Valid/True  Invalid/False  
Judgment of Null Hypothesis (H_{0})  Reject  False Positive Type I error 
True Positive Correct inference 
Fail to reject (accept)  True negative Correct inference 
False negative Type II error 
Clear as mud, huh?
Fortunately, there are a couple of ways to sharpen this standard story in the context of information retrieval (IR), natural language processing (NLP) and machine learning (ML) — the domains of direct interest to us at Structured Dynamics — to make understanding all of this much simpler. Statistical tests will always involve a trade off between the level of false positives (in which a nonmatch is declared to be a match) and the level of false negatives (in which an actual match is not detected) [1]. Let’s see if we can simplify our recognition and understanding of these conditions.
First, let’s start with a recent explanation from the KDNuggets Web site [2]:
“Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong:
The use of ‘case’ and ‘predictions’ help, but are still a bit confusing. Let’s hear another explanation from Benjamin Roth from his recently completed thesis [3]:
“There are two error cases when extracting training data: false positive and false negative errors. A false positive match is produced if a sentence contains an entity pair for which a relation holds according to the knowledge base, but for which the sentence does not express the relation. The sentence is marked as a positive training example for the relation, however it does not contain a valid signal for it. False positives introduce errors in the training data from which the relational model is to be generalized. For most models false positive errors are the most critical error type, for qualitative and quantitative reasons, as will be explained in the following.
“A false negative error can occur if a sentence and argument pair is marked as a negative training example for a relation (the knowledge base does not contain the argument pair for that relation), but the sentence actually expresses the relation, and the knowledge base was incomplete. This type of error may negatively influence model learning by omitting potentially useful positive examples or by negatively weighting valid signals for a relation.”
In our context, we can see a couple of differences from traditional scientific hypothesis testing. First, the problems we are dealing with in IR, NLP and ML are all statistical classification problems, specifically in binary classification. For example, is a given text token an entity or not? What type amongst a discrete set is it? Does the token belong to a given classification or not? This makes it considerably easier to posit an alternative hypothesis and the shape of its distribution. What makes it binary is the decision as to whether a given result is correct or not. We now have a different set of distributions and tests from more common normal distributions.
Second, we can measure our correct ‘hits’ by applying our given tests to a “gold standard” of known results. This gold standard provides a representative sample of what our actual population looks like, one we have characterized in advance whether all results in the sample are true or not for the question at hand. Further, we can use this same gold standard over and over again to gauge improvements in our test procedures.
Combining these thoughts leads to a much simpler matrix, sometimes called a confusion matrix in this context, for laying out the true and false, positive and negative characterizations:
Correctness  Test Assertion  

Positive  Negative  
True  TP True Positive 
TN True Negative 
False  FP False Positive 
FN False Negative 
As we can see, ‘positive’ and ‘negative’ are simply the assertions (predictions) arising from our test algorithm of whether or not there is a match or a ‘hit’. ‘True’ and ‘false’ merely indicate whether these assertions proved to be correct or not as determined by gold standards or training sets. A false positive is a false alarm, a “crying wolf”; a false negative is a missed result. Thus, all true results are correct; all false are incorrect.
Armed with these four characterizations — true positive, false positive, true negative, false negative — we now have the ability to calculate some important statistical measures. Most of these IR measures also have exact analogs in standard statistics, which I also note.
The first metric captures the concept of coverage. In standard statistics, this measure is called sensitivity; in IR and NLP contexts it is called recall. Basically it measures the ‘hit’ rate for identifying true positives out of all potential positives, and is also called the true positive rate, or TPR:
Expressed as a fraction of 1.00 or a percentage, a high recall value means the test has a high “yield” for identifying positive results.
Precision is the complementary measure to recall, in that it is a measure for how efficient whether positive identifications are true or not:
Precision is something, then, of a “quality” measure, also expressed as a fraction of 1.00 or a percentage. It provides a positive predictive value, as defined as the proportion of the true positives against all the positive results (both true positives and false positives).
So, we can see that recall gives us a measure as to the breadth of the hits captured, while precision is a statement of whether our hits are correct or not. We also see, as in the Roth quote above, why false positives need to be a focus of attention in test development, because they directly lower precision and efficiency of the test.
This recognition that precision and recall are complementary and linked is reflected in one of the preferred overall measures of IR and NLP statistics, the Fscore, which is the adjusted (beta) mean of precision and recall. The general formula for positive real β is:
which can be expressed in terms of TP, FN and FP as:
In many cases, the harmonic mean is used, which means a beta of 1, which is called the F_{1} statistic:
But F1 displays a tension. Either precision or recall may be improved to achieve an improvement in F_{1}, but with divergent benefits or effects. What is more highly valued? Yield? Quality? These choices dictate what kinds of tests and areas of improvement need to receive focus. As a result, the weight of beta can be adjusted to favor either precision or recall. Two other commonly used F measures are the F_{2} measure, which weights recall higher than precision, and the F_{0.5} measure, which puts more emphasis on precision than recall [4].
Another metric can factor into this equation, though accuracy is a less referenced measure in the IR and NLP realm. Accuracy is the statistical measure of how well a binary classification test correctly identifies or excludes a condition:
An accuracy of 100% means that the measured values are exactly the same as the given values.
All of the measures above simply require the measurement of false and true, positive and negative, as do a variety of predictive values and likelihood ratios. Relevance, prevalence and specificity are some of the other notable measures that depend solely on these metrics in combination with total population.
By bringing in some other rather simple metrics, it is also possible to expand beyond this statistical base to cover such measures as information entropy, statistical inference, pointwise mutual information, variation of information, uncertainty coefficients, information gain, AUCs and ROCs. But we’ll leave discussion of some of those options until another day.
Courtesy of one of the major templates in Wikipedia in the statistics domain [5], for which I have taken liberties, expansions and deletions, we can envision the universe of statistical measures in IR and NLP, based solely on population and positives and negatives, true and false, as being:
Condition (as determined by “Gold standard“)  
Total population  Condition positive  Condition negative  Prevalence = Σ Condition positive Σ Total population 

Test Assertion 
Test assertion positive 
TP True positive 
FP False positive (Type I error) 
Positive predictive value (PPV), Precision = Σ True positive Σ Test outcome positive 
False discovery rate (FDR) = Σ False positive Σ Test outcome positive 
Test assertion negative 
FN False negative (Type II error) 
TN True negative 
False omission rate (FOR) = Σ False negative Σ Test outcome negative 
Negative predictive value (NPV) = Σ True negative Σ Test outcome negative 

Accuracy (ACC) = Σ True positive + Σ True negative Σ Total population 
True positive rate (TPR), Sensitivity, Recall = Σ True positive Σ Condition positive 
False positive rate (FPR),Fallout = Σ False positive Σ Condition negative 
Positive likelihood ratio (LR+) = TPR FPR 
Fscore (F_{1} case) = 2 x (Precision * Recall) (Precision + Recall) 

False negative rate (FNR) = Σ False negative Σ Condition positive 
True negative rate (TNR), Specificity (SPC) = Σ True negative Σ Condition negative 
Negative likelihood ratio (LR−) = FNR TNR 
Please note that the order and location of TP, FP, FN and TN differs from my simple layout presented in the confusion matrix above. In the confusion matrix, we are gauging whether the assertion of the test is correct or not as established by the gold standard. In this current figure, we are instead using the positive or negative status of the gold standard as the organizing dimension. Use the shorthand identifiers of TP, etc., to make the cross reference between “correct” and “condition”.
These basic measures and understandings have two further important roles beyond informing how to improve the accuracy and peformance of IR and NLP algorithms and tests. The first is gold standards. The second is training sets.
Gold standards that themselves contain false positives and false negatives, by definition, immediately introduce errors. These errors make it difficult to test and refine existing IR and NLP algorithms, because the baseline is skewed. And, because gold standards also often inform training sets, errors there propagate into errors in machine learning. It is also important to include true negatives in a gold standard, in the likely ratio expected by the overall population, so that this complement of the accuracy measurement is not overlooked.
Once a gold standard is created, you then run your current test regime against it when you run your same tests againt unknowns. Preferably, of course, the gold standard only includes true positives and true negatives (that is, the gold standard is the basis for judging “correctness'; see confusion matrix above). In the case of running an entity recognizer, your results against the gold standard can take one of three forms: you either have open slots (no entity asserted); slots with correct entities; or slots with incorrect entities. Thus, here is how you would create the basis for your statistical scores:
As noted before, these measures are sufficient to calculate the precision, recall, Fscore and accuracy statistics. Also note that the F v T and P v N correspond to the gold standard “correctness” and what is asserted by the test(s), per the confusion matrix.
We can apply this same mindset to the second additional, important role in creating and evaluating training sets. Both positive and negative training sets are recommended for machine learning. Negative training sets are often overlooked. Again, if the learning is not based on true positives and negatives, then significant error may be introduced into the learning.
Clean, vetted gold standards and training sets are thus a critical component to improving our knowledge bases going forward [6]. The very practice of creating gold standards and training sets needs to receive as much attention as algorithm development because, without it, we are optimizing algorithms to fuzzy objectives.
The virtuous circle that occurs between more accurate standards and training sets and improved IR and ML algorithms is a central argument for knowledgebased artificial intelligence (KBAI). Continuing to iterate better knowledge bases and validation datasets is a driving factor in improving both the yield and quality from our rapidly expanding knowledge bases.
Mavlyutov et al. have posted a preprint [1] of their upcoming paper to be presented at ESWC at the end of the month covering the most efficient representation of URIs in information systems. All of us who do largescale work with the semantic Web or linked data should be interested in these findings.
To my knowledge, the paper is the first one to explicitly evaluate common data structures for encoding, storing and retrieving URIs at scale. As the unique identifiers for resources, there may be millions to billions needing to be stored and retrieved from triple stores or other database backends.
The authors compared a dozen different methods for storing URIs according to the standard needs to index, insert and retrieve URIs, including encoding and decoding, at scale. Memory and operation times were measured. The methods evaulated were specific RDF systems; various hash maps; various hash tables; binary search, B+, ART (adaptive radix), and lexicographic trees; and the HATtrie.
Different operational needs may point to different methods. However, the authors conclude that “overall, the HATtrie appears to be a good compromise taking into account all aspects, i.e., memory consumption, loading time, and lookups. ART also appears as an appealing structure, since it maintains the data in sorted order, which enables additional operations like range scans and prefix lookups, and since it still remains time and memory efficient.”
This paper should be a useful reference for any group that needs to manage URIs at scale.
The six months since the last major release of UMBEL (Upper Mapping and Binding Exchange Layer) have been spent in improving the coherence and broadening the usefulness for the ontology. Structured Dynamics is today releasing version 1.20 of the open source UMBEL.
UMBEL’s first purpose is to provide a general vocabulary of classes and predicates for describing domain ontologies, with the specific aim of promoting interoperability with external datasets and domains. UMBEL’s second purpose is to provide a coherent framework of reference subjects and topics for grounding relevant Webaccessible content. UMBEL presently has about 35,000 of these reference concepts drawn from the Cyc knowledge base, split into ‘core’ and a series of optional modules, which are organized into 32 mostly disjoint SuperTypes.
The key advances in this new 1.20 version of UMBEL include refinements to the UMBEL generator, improved tests for satisfiabliity and coherence, and additional mappings and structure to aid UMBEL’s role as a computing overlay for existing knowledge bases, such as Wikipedia. Part of the latter advance is being aided by the new addition of an Attributes Ontology to UMBEL as described in the prior articles of An UMBEL Extension for Attributes and Conceptual and Practical Distinctions in the Attributes Ontology.
These are the principal changes between the last public release, version 1.10, and this version 1.20:
Entities
SuperType, with 20,393 RCs designated. The Entities ST is by definition nondisjoint with UMBEL’s other SuperTypesWorkplaces
SuperType, and merged with the Facilities
STMarketIndustries
SuperType, and merged with the Attributes
STEvents
and Activities
SuperTypes was improved. See Annex Z for the updated ST assignment statisticsThe Web and enterprises in general are characterized by growing, diverse and distributed information sources and data. Some of this information resides in structured databases; some resides in schema, standards, metadata, specifications and semistructured sources; and some resides in general text or media where the content meaning is buried in unstructured form. Given these huge amounts of information, how can one bring together what subsets are relevant? And, then for candidate material that does appear relevant, how can it be usefully combined or related given its diversity? In short, how does one go about actually combining diverse information to make it interoperable and coherent?
UMBEL was conceived to provide a reference grounding to achieve these very aims. UMBEL’s vocabulary is designed to recognize that different sources of information have different contexts and different structures, and meaningful connections between sources are not always exact. UMBEL’s 35,000 reference concepts — drawn from the logically consistent Cyc knowledge base backed by 1000 personyears of development and testing — provide a set of fixed references by which we can orient, map and navigate external content. These UMBEL reference concepts form a knowledge graph (you can see a big graph visualization of this structure) of subject nodes that may be related to external classes and individuals (instances and named entities). Via this coherent structure, we gain some important benefits:
UMBEL is being developed and refined via largescale use cases. A number of improvements have been brought to the system to make it more testable, manageable, and flexible.
The first improvement was to introduce the socalled SuperTypes to UMBEL. All UMBEL reference concepts are assigned to one or more of 32 SuperTypes
, organized into nine dimensions (details may be found here). The four SuperTypes
of Attributes, Abstractlevel, Entities and Topics/Categories are designed to be fully nondisjoint, and do not participate in any disjoint assertions. The remaining 28 SuperTypes
are designed to be as disjoint as possible:
Natural World  Natural Phenomena 
Natural Substances  
Earthscape  
Extraterrestrial  
Living Things  Prokaryotes 
Protists & Fungus  
Plants  
Animals  
Diseases  
Person Types  
Human Activities  Organizations 
Finance & Economy  
Society  
Activities  
Timerelated  Events 
Time  
Human Works  Products 
Food or Drink  
Drugs  
Human Places  Facilities 
Geopolitical  
Information  Chemistry (n.o.c) 
Audio Info  
Visual Info  
Written Info  
Structured Info  
Notations & References  
Numbers  
Descriptive  Attributes 
Classificatory  Abstractlevel 
Entities  
Topics/Categories 
To make UMBEL more tractable, we have also modularized it into ‘core’, ‘geo’, ‘entities’, and ‘attributes’ modules (the latter two modules being added in this new release). The modules can be swapped out with other external options or left out of analysis if not needed for a given domain interest. We also have formal mappings to other important external reference sets such as Wikipedia, OpenCyc, schema.org, the DBpedia ontology,GeoNames and PROTON. UMBEL’s GitHub site provides these mappings.
Beginning with version 1.10, we also added a new UMBEL generator written in Clojure that allows the entire system to be built and tested from a series of simple input files. We are now using this system aggressively to discover gaps and misassignments in the UMBEL structure, as well as to achieve balance in scope and coverage. The system ties into the OWL API for certain tests and capabilities (UMBEL is OWL 2compliant).
Though UMBEL retains its same mission as when the system was first formulated eight years ago, we also see its role expanding. The two key areas of expansion are in UMBEL’s use to model and map instance data attributes and in acting as a computable overlay for Wikipedia (and other knowledge bases). These two areas of expansion are still a work in progress.
This UMBEL version 1.20 marks the first expression of the Attributes Ontology. While we have organized what already had existed in attribute concepts (that is, those concepts that capture the descriptive data related to how to characterize instance records), some gaps remain in both UMBEL and the source Cyc. Using the new ontology to map against the properties in the DBpedia and schema.org vocabularies is the next priority. These direct use cases are needed to ground the ontology in important, realworld information markup systems. We will also be looking at linking to an existing units and measurements ontology such as QUDT. There likely will need to be a series of releases over time to capture and test these uses.
The mapping to Wikipedia is now about 72% complete. While we are testing automated mapping mechanisms, because of its central role we also need to vet all UMBELWikipedia mapping assignments. This effort is pointing out areas of UMBEL that are overspecified, underspecified, and sometimes duplicative. By placing UMBEL in an intermediate position between Cyc and Wikipedia we are finding differences and gaps on both ends, as well as gaps within UMBEL itself. Our goal is to get to a 100% coverage point with Wikipedia, and then to exercise the structure for machine learning and other tests against the KB. These efforts will enable us to enhance the semsets in UMBEL as well as to move toward multilingual versions. This effort, too, is still a work in progress.
Despite these desired enhancements, we are using all aspects of UMBEL and its mappings to both aid these expansions and to test the existing mappings and structure. These efforts are proving the virtuous circle of improvements that is at the heart of UMBEL’s purposes.
The UMBEL Web site provides various online tools and Web services for exploring and using UMBEL. The UMBEL GitHub site is where you can download the UMBEL Vocabulary or the UMBEL Reference Concept ontology, both under a Creative Commons Attribution 3.0 license. Other documents and backup are also available from that location.
Technical specifications for UMBEL and its various annexes are available from the UMBEL wiki site. You can also download a PDF version of the specifications from there. You are also welcomed to participate on the UMBEL mailing list or LinkedIn group.
]]>