Posted:November 23, 2015

Photo courtesy of Syndical.comKnowledge Bases Enable a More Systematic Approach to Feature Engineering

The two most labor-intensive steps in machine learning for natural language are: 1) feature engineering; and 2) labeling of training sets. Supervised machine learning uses these training sets where every point is an input-output pair, mapping an input, which is a feature, to an output, which is the label. The machine learning consists of inferring (“learning”) a function that maps between these inputs and outputs with acceptable predictive power. This learned function can then be applied to previously unseen inputs in order to predict the output label. The technique is particularly suited to problems of regression or of classification.

It is not surprising that the two most labor-intensive tasks in machine learning involve determining the suitable inputs (features) and correctly labeling the output training labels. Elsewhere in this series I discuss training sets and labeling in detail. For this current article, we focus on features.

“Features” are perhaps the least discussed aspect of machine learning. References are made to how to select them; how to construct, extract or learn them; or how even to overall engineer them. But little guidance is provided as to what features exactly are. There really is no listing or inventory for what “features” might even be considered in the various aspects of natural language or text understanding. In part, I think, because we do not have this concrete feel for features, we also don’t tend to understand how to maximize and systematize the role of features in support of our learning tasks. This aspect provides a compelling rationale for the advantages of properly constructed knowledge bases in support of artificial intelligence, what we have been terming as KBAI in this series.

So, before we can understand how to best leverage features in our KBAI efforts, we need to first define and name the feature space. That effort, in turn, enables us to provide a bit of an inventory for what features might contribute to natural language or knowledge base learning. We then organize that inventory a bit to point out the structural and conceptual relationships among these features, which enables us to provide a lightweight taxonomy for the space.

Since many of these features have not been named or exposed before, we conclude the article with some discussion about what next-generation learners may gain by working against this structure. Of course, since much of this thinking is incipient, there are certainly forks and deadends in what may unfold ahead, but there also will likely be unforeseen expansions and opportunities as well. A systematic view of machine learning in relation to knowledge and human language features — coupled with large-scale knowledge bases such as Wikipedia and Wikidata — can lead to faster and cheaper learners across a very broad range of NLP tasks [1].

What is a Feature?

A “feature is an individual measurable property of a phenomenon being observed” [2]. It is an input to a machine learner, an explanatory variable, sometimes in the form of a function. Features are sometimes equated to attributes, but this is not strictly true, since a feature may be a combination of other features, or a statistical calculation, or an abstraction of other inputs. In any case, a feature must be expressed as a numeric value (including Boolean) upon which the machine learner can calculate its predictions. Machine learner predictions of the output can only be based on these numeric features, though they can be subject to rules and weights depending on the type of learner.

The importance of features and the fact they may be extracted or constructed from other inputs is emphasized in this quote from Pedro Domingos [3]:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. . . . Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are. This is typically where most of the effort in a machine learning project goes. It is often also one of the most interesting parts, where intuition, creativity and ‘black art’ are as important as the technical stuff.”

Many experienced ML researchers make similar reference to the art or black art of features. In broad strokes, a feature may be a surface form, like terms or syntax or structure (such as hierarchy or connections); it may be derived (such as statistical, frequency, weighted or based on the ML model used); it may be semantic (in terms of meanings or relations); or it may be latent, as either something hidden or abstracted from feature layers below it. Unsupervised learning or deep learning features arise from the latent form.

For a given NLP problem domain, features can number into the millions or more. Concept classification, for example, could use features corresponding to all of the unique words or phrases in that domain. Relations between concepts could also be as numerous. To assign a value to such “high-dimensional” features, some form of vector relationship is calculated over, say, all of the terms in the space so that each term can be represented numerically [4]. Because learners may learn over multiple feature types, the potential combinations to be evaluated for the ML learner can literally be astronomical. This combinatorial problem has been known for decades, and has been termed the curse of dimensionality for more than 50 years [5].

Of course, just because a feature exists says nothing about whether it is a piece of information that might be useful for ML predictions or not. Features may thus prove to be one of four kinds: 1) strongly relevant; 2) weakly relevant; 3) irrelevant; or 4) redundant [6]. Strongly relevant features should always be considered; weakly relevant may sometimes be combined to improve the overall relevancy. All irrelevant or redundant features should be removed from consideration. Generally, the fewer the features the better, so long as the features used are strongly relevant and orthogonal (that is, they capture different aspects of the prediction space).

A (Partial) Inventory and Taxonomy of Natural Language and KB Features

To make this discussion more tangible, we have assembled a taxonomy of feature types in the context of natural language and knowledge bases. This inventory is drawn from the limited literature on feature engineering and selection in the context of KBAI from the perspectives of ML learning in general [7, 8, 9], ML learning ontologies [10, 11, 12] and knowledge bases [13, 14, 15, 16, 17]. This listing is only partial, but does provide an inventory of more than 200 feature types applicable to natural language.

We have organized this inventory into eight (8) main areas, shown in non-italicized Bold, which tend to cluster into these four groupings:

  • Surface features — these are features that one can see within the source documents and knowledge bases. They include Lexical items for the terms and phrases in the domain corpus and knowledge base; Syntactical items that show the word order or syntax of the domain; Structural items that either split the documents and corpus into parts or reflect connections and organizations of the items, such as hierarchies and graphs; or Natural Language items that reflect how the content is expressed in the surface forms of various human languages
  • Derived features — are surface features that have been transformed or derived in some manner, such as the direct Statistical items or the Model-based ones reflecting the characteristics of the machine learners used
  • Semantic features — these are summarized under the Semantics area, and reflect what the various items mean or how they are conceptually related to one another, and
  • Latent features — these features are not observable from the source content. Rather, these are statistically derived abstractions of the features above that are one- to N-levels removed from the initial source features. These Latent items may either be individual features or entire layers of abstraction removed from the surface layer. These features result from applying unsupervised or deep learning machine learners.

Here is the taxonomy, again organized according to those same eight main areas:

Cut-offs (top N)
Named entities
Complements (argument)
Dependency grammar
Head (linguistic)
Parts of speech (POS)
Word order
Mutual information
Sample measures
Document frequencies
Frequencies (corpus)
Document frequencies
Frequencies (corpus)
String similarity
Cosine measures
Feature vectors
Node types
Document parts
Disambiguation pages
Discussion pages
Graphs (and ontologies)
Metrics (counts, averages, min/max)
Section hierarchy
Missing attributes
Missing values
Language versions
Linked data
See also
Dependency patterns
Surface patterns
Regular expressions
Document parts
Source forms
Blog posts
Research articles
Technical documents
Microblogs (tweets)
Web pages
Breadth measures
Depth measures
Web pages
Semantics [most also subject to Syntactical and Statistical features above)
Alternative labels
Preferred labels
Association rules
See also
Attribute Types
Eponymous pages
Grouped concepts (topics)
Hypernym-based feature vectors
Entity Types
General semantic feature vectors
Relation Types
Logical conjunctions
Mereology (part of)
Viewpoint (World view)
Natural Languages
Word order
Many; dependent on method
Many; dependent on method
Many; dependent on method
Many; dependent on method
Decision tree
Tree measures
Feature characteristics
Standard deviation
Factor graphs
Learner accuracy
Method measures
Error rates
Table 1. A (Partial) Taxonomy of Machine Learning Features

This compilation exceeds any prior listings. In fact, most of the feature types shown have never been applied to NLP machine learning tasks. We now turn the discussion to why this is.

Mindset and Knowledge Bases

When one sees the breadth of impressive knowledge discovery tasks utilizing large-scale knowledge bases [18], exemplified by hundreds of research papers regarding NLP tasks utilizing Wikipedia [19], it is but a small stretch to envision a coherent knowledge base leveraging this content for the express purpose of making text-based machine learning systematic and less expensive. Expressed as an objective function, we now have clear guidance for how to re-organize and -express the source content information (Wikipedia, among others) to better support a ML learning factory. The idea of this and how it is driving Structured Dynamics‘ contracts and research is the mindset.

Rather than the singleton efforts to leverage knowledge bases for background knowledge, as has been the case to date, we can re-structure the knowledge source content underneath a coherent knowledge graph. This re-organization makes the entire knowledge structure computable and amenable to machine learning. It also enables the same learning capability to be turned upon itself (see image here), thereby working to improve the coverage and accuracy of the starting KB, all in a virtuous circle. Because of the mindset, we also can now look at the native structure of the KBs and work to expose still more features, providing still further grist to the next generation ML learners. Fully 50% of the features listed in the inventory in Table 1 above arise from these unique KB aspects, especially in the areas of Semantics and Structural, including graph relationships.

Many, if not most, of these new feature possibilities may prove redundant or only somewhat relevant. Not all features may ever prove useful, though some not generally used in many broader learners, such as case, may be effectively employed for named entity or specialty extractions, such as for copyrights or unique IDs or data types. Because many of these KB features cover quite orthogonal aspects of the source knowledge bases, the likelihood of finding new, strongly relevant features is high. Further, except for the Latent and Model-based areas, each of these feature types may be used singly or in combination to create coherent slices for both positive and negative training sets, helping to reduce the effort for labor-intensive labeling as well. By extension, these capabilities can also be applied to more effectively bootstrap the creation of gold standards, useful when parameters are being tested for the various machine learners.

Though the literature most often points to classification as the primary use of knowledge bases as background knowledge supporting machine learners, in fact many NLP tasks may leverage KBs. Here is but a brief listing of application areas for KBAI:

  • Entity recognizers
  • Relation extractors
  • Classifiers
  • Q & A systems
  • Knowledge base mappings
  • Ontology development
  • Entity dictionaries
  • Data conversion and mapping
  • Master data management
  • Specialty extractors
Table 2. NLP Applications for Machine Learners Using KBs

Surely other applications will emerge as this more systematic KBAI approach to machine learning evolves over the coming years.

Feature Engineering is an Essential Component

As noted, this richness of feature types leads to the combinatorial problem of too many features. Feature engineering is important both to help find the features of strongest relevance while reducing the feature space dimensionality in order to speed the ML learning times.

Initial feature engineering tasks should be to transform input data, regularize them if need be, and to create numeric vectors for new ones. These are basically preparation tasks to convert the source or target data to forms amenable to machine learning. This staging now enables us to discover the most relevant (“strong”) features for the given ML method under investigation.

In a KB context, specific learning tasks as outlined in Table 2 are often highly patterned. The most effective features for training, say, an entity recognizer, will only involve a limited number of strongly relevant feature types. Moreover, the relevant feature types applicable to a given entity type should mostly apply to other entity types, even though the specific weights and individually important features will differ. This patterned aspect means that once a given ML learner is trained for a given entity type, its relevant feature types should be approximately applicable to other related entity types. The lengthy process of initial feature selection can be reduced as training proceeds for similar types. It appears that combinations of feature types, specific ML learners and methods to create training sets and gold standards may be discovered for entire classes of learning tasks. These combinations can be discovered, tested and repeated for new specific tasks within a given application cluster.

Probably the most time-consuming and demanding aspect of these patterned approaches resides in feature selection and feature extraction.

Feature selection is the process of finding a subset of the available feature types that provide the highest predictive value while not overfitting [20]. Feature selection is typically split into three main approaches [6, 21, 22]:

  • Filter — select the N most promising features based on a ranking from some form of proxy measure, like mutual information or the Pearson correlation coefficient, which provides a measure of the information gain from using a given feature type
  • Wrapper — wherein feature subsets are tested through a greedy search heuristic that either starts with an empty set and adds features (forward selection) keeping the “strongest” ones, or starts with a full set and gradually removes the “weakest” ones (backward selection); the wrapper approach may be computationally expensive, or
  • Embedded — wherein feature selection is a part of model construction.

For high-dimensional features, such as terms and term vectors, one may apply stoplists or cut-offs (only considering the top N most frequent terms, for example) to reduce dimensionality. Part of the “art” portion resides in knowing which feature candidates may warrant formal selection or not; this learning can be codified and reused for similar applications. Extractions and some unsupervised learning tests may also be applied at this point in order to discover additional “strong” features.

Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. Functions create new features in the form of Latent variables, which are not directly observable. Also, because these are statistically derived values, many input features are reduced to the synthetic measure, which naturally causes a reduction in dimensionality. Advantages from a reduction in dimensionality include:

  1. Often a better feature set (resulting in better predictions) [23]
  2. Faster computation and smaller storage
  3. Reduction in collinearity due to reduction in weakly interacting inputs
  4. Easier graphing and visualization.

On the other hand, the latent features are abstractions, and so not easily understood as the literal.

In deep learning, multiple layers of these latent features are generated as the system learns. But latent passes may also be combined with observable features, which is one way that evaluations of what a document means can be applied across multiple input forms of the content.

Of course, it is also possible to combine the predictions from multiple ML methods, which then also raises the questions of ensemble scoring. Surely we will also see these more systematic approaches to machine learning themselves be subject to self-learning (that is, metalearning), such that the overall learning process can proceed in a more automated way.

Considerations for a Feature Science

In supervised learning, it is clear that more time and attention has been given to the labeling of the data, what the desired output of the model should be. Much less time and attention has been devoted to features, the input side of the equation. As a result, much needs to be done. The purposeful use of knowledge bases and structuring them properly is one of the ways progress will be made.

But progress also requires some answers to some basic questions. A scientific approach to the feature space would likely need to consider, among other objectives:

  • Full understanding of surface, derived and latent features
  • Relating various use cases and problems to specific machine learners and classes of learners
  • Relating specific machine learners to the usefulness of particular features (see also hyperparameter optimization and model selection)
  • Improved methods for feature engineering and construction
  • Improved methods for feature selection
  • A better understanding of how to select and match supervised and unsupervised ML.

Some tools and utilities would also help to promote this progress. Some of these capabilities include:

  • Feature inventories — how to create and document taxonomies of feature types
  • Feature generation — methods for codification of leading recipes
  • Feature transformations — the same for transformations, up to and including vector creation
  • Feature validation — ways to test feature sets in standard ways.

Role of a Platform

The object of these efforts is to systematize how knowledge bases, combined with machine learners, can speed the deployment and lower the cost of creating tailored artificial intelligence applications of natural language for specific domains. This installment in our KBAI series has focused on the role and importance of features. There is an abundance of opportunity in this area, and an abundance of work required, but little systematization.

The good news is that platforms are possible that can build, manage, and grow the knowledge bases and knowledge graphs supporting machine learning. Machine learners can be applied in a pipeline manner to these KBs, including orchestrating the data flows in generating and testing features, running and testing learners, creating positive and negative training sets, and establishing gold standards. The heart of the platform must be an appropriately structured knowledge base organized according to a coherent knowledge graph; this is the present focus of Structured Dynamics’ efforts.

In the real world, engagements always demand unique scope and unique use cases. Platforms should be engineered that enable ready access, extensions, configurations, and learners. It is important to structure the KBs such that slices and modules can be specified, and all surface attributes may be selected and queried. Mapping to external schema is also essential. Background knowledge from a coherent knowledge base is the way to fuel this.

[1] Features apply to any form of machine learning, including for things like image, speech and pattern recognition. However, this article is limited to the context of natural language, unstructured data and knowledge bases.
[2] See the feature entry from Wikipedia, which itself is based upon Christoper Bishop, 2006. Pattern Recognition and Machine Learning. Berlin: Springer. ISBN 0-387-31073-8.
[3] Pedro Domingos, 2012. “A Few Useful Things to Know About Machine Learning,” Communications of the ACM 55, no. 10 (2012): 78-87.
[4] For example, in the term or phrase space, the vectors might be constructed from counts, frequencies, cosine relationships between representative documents, distance functions between terms, etc.
[5] Richard Ernest Bellman, 1957. Dynamic Programming, Rand Corporation, Princeton University Press, ISBN 978-0-691-07951-6, as republished as Richard Ernest Bellman, 2003. Dynamic Programming, Courier Dover Publications, ISBN 978-0-486-42809-3.
[6] Isabell Guyon and André Elisseeff, 2006. “An Introduction to Feature Extraction,” in Guyon, Isabelle, Steve Gunn, Masoud Nikravesh, and Lofti A. Zadeh, eds. Feature Extraction: Foundations and Applications, pp. 1-25. Springer Berlin Heidelberg, 2006.
[7] Haussler, David, 1999. Convolution Kernels on Discrete Structures, Vol. 646. Technical Report UCSC-CRL-99-10, Department of Computer Science, University of California at Santa Cruz, 38 pp., July 8, 1999.
[8] Reif, Matthias, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel, 2014. “Automatic Classifier Selection for Non-experts,” Pattern Analysis and Applications 17, no. 1 (2014): 83-96.
[9] Tang, Jiliang, Salem Alelyani, and Huan Liu, 2014. “Feature Selection for Classification: A Review.” Data Classification: Algorithms and Applications (2014): 37
[10] Melanie Hilario, Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis, 2011 “Ontology-based Meta-mining of Knowledge Discovery Workflows,” in Meta-Learning in Computational Intelligence, pp. 273-315. Springer Berlin Heidelberg, 2011.
[11] Panče Panov, Larisa Soldatova, and Sašo Džeroski, 2014. “Ontology of Core Data Mining Entities,” Data Mining and Knowledge Discovery 28, no. 5-6 (2014): 1222-1265.
[12] See the general KBAI category entries on M.K. Bergman, AI3:::Adaptive Information blog, various dates.
[13] Ivo Anastacio, Bruno Martins and Pavel Calado, 2011. “Supervised Learning for Linking Named Entities to Knowledge Base Entries,” in Proceedings of the Text Analysis Conference (TAC2011).
[14] Weiwei Cheng, Gjergji Kasneci, Thore Graepel, David Stern, and Ralf Herbrich, 2011. “Automated Feature Generation from Structured Knowledge,” in Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1395-1404. ACM, 2011.
[15] Lan Huang, David Milne, Eibe Frank, and Ian H. Witten, 2012. “Learning a Concept‐based Document Similarity Measure.” Journal of the American Society for Information Science and Technology 63, no. 8 (2012): 1593-1608.
[16] Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp.
[17] Hui Shen, Mika Chen, Razvan Bunescu, and Rada Mihalcea, 2012. “Wikipedia Taxonomic Relation Extraction using Wikipedia Distant Supervision,” Ann Arbor 1001: 48109.
[18] Conventional knowledge bases have also been supplemented with massive-scale statistical bases, most often created from major search engine indexes; see the section on ‘Statistical Corpora’ in M.K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” AI3:::Adaptive Information blog, November 14, 2014.
[19] See M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010. The listing as of its last update included 246 articles; also, see Wikipedia’s own “Wikipedia in Academic Studies.”
[20] Overfitting is where a statistical model, such as a machine learner, describes random error or noise instead of the underlying relationship. It is particularly a problem in high-dimensional spaces, a common outcome of employing too many features.
[21] George H. John, Ron Kohavi, and Karl Pfleger, 1994. “Irrelevant features and the subset selection problem.” In Machine Learning: Proceedings of the Eleventh International Conference, pp. 121-129. 1994.
[22] See especially slide #11 in Zdeněk Žabokrtský, 2015. “Feature Engineering in Machine Learning,” Machine Learning Methods course, Institute of Formal and Applied Linguistics, Charles University in Prague, Czech Republic.
[23] If constructed properly, deep learning models can be effective feature extractors over high-dimensional data; see Geoffrey E. Hinton, 2009. “Deep Belief Networks,” Scholarpedia 4 (5): 5947, which references an earlier paper, Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. “A Fast Learning Algorithm for Deep Belief Nets,” Neural Computation 18, no. 7 (2006): 1527-1554.
Posted:November 4, 2015

AI3 PulseIt’s Time to Keep Some Powder Dry

Here is an excerpt from yesterday’s post by MN Gordon* dealing with the familiar concept to computer scientists of ‘garbage in, garbage out’ (GIGO) as applied to monetary policy:

“The Fed believes that by fixing the price of money artificially low, they’ll increase something they call ‘aggregate demand.’  The thesis is that cheap credit will compel individuals and businesses to borrow more and consume more.  Before you know it, the good times will be here again.  Profits will increase.  Jobs will be created.  Wages will rise.  A new cycle of expansion will take root.  Sounds great, doesn’t it?

“In practice, however, the results are destructive.  While cheap credit may have a stimulative influence on an economy with moderate debt levels, once an economy has reached total debt saturation, where the economy can no longer support its debt overhang, the cheap credit trick no longer works to stimulate the economy.  Like applying additional fertilizer to an already overstimulated crop field, the marginal return of each unit of additional credit in terms of new growth diminishes to nothing.  In fact, the additional credit, and its counterpart debt, actually strangles future growth.

“The experience following the Great Recession is that the abundance of cheap credit floods not into the economy, but into asset prices…grossly distorting them in the process.  The simple fact is solving the problem of too much debt by pushing more debt doesn’t solve the problem at all.  It makes it worse.

“. . . .  The point is the radical monetary policy interventions being employed by the Fed to somehow improve the economy are being guided by garbage.”

I don’t often comment on economic or political matters, but I have deep near-term concerns about debt and monetary policy.

* M.N. Gordon, 2015. “Garbage In Garbage Out Economics,” from Economic Prism blog, November 3, 2015.

Posted by AI3's author, Mike Bergman Posted on November 4, 2015 at 4:22 pm in Pulse | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:November 2, 2015

Knowledge-based Artificial Intelligence Provides a Systematic Basis for Machine Learning

Inherent in Structured Dynamics‘ discussions about knowledge-based artificial intelligence (KBAI) have been some embedded premises. Some of my prior articles — and future ones to come — elaborate more fully on one or more of these points:

  1. The electronic availability of content-rich knowledge bases has been the most important catalyst for recent AI advances in natural language and information processing
  2. Wikipedia, and its DBpedia and now Wikidata derivatives, has been the most important source of concept and entity information for these purposes
  3. None of these sources is coherently organized; attempts to use lexical relationships (WordNet) or Wikipedia itself (DBpedia ontology) to re-organize the content are also not coherent
  4. Despite this incoherence, these knowledge bases have already been used to train many distant supervised machine learning applications; but, in efforts to date, each application has been manually trained, which is inefficient and time consuming
  5. Fortunately, these knowledge bases can be mapped to a coherent structure; there are perhaps options; we have chosen Cyc
  6. Once the potential role of KBs to inform machine learning is understood, the usefulness becomes obvious to re-express the KBs to maximize the features available for machine learning, including disjointedness assertions to enable selection of positive and negative training sets
  7. Specific aspects of the KBs for which such re-organization is appropriate include concepts, types, entities, relations, events, attributes and statements
  8. Therefore, a systematic re-organization of these KBs to support feature and training set generation can help automate and lower the cost of machine learning pipelines
  9. Once these features and aspects are established, the result becomes a grounding structure, which can facilitate mappings to other knowledge structures, data interoperability and information integration
  10. These same principles can be applied to existing or new knowledge bases, thereby increasing the scope and usefulness of the knowledge structure in a virtuous circle.

Precise definitions for all of the italicized terms are provided in the related glossary.

Posted:September 21, 2015

Steam engine in action, from WikipediaPractical and Reusable Designs to Make Knowledge Bases Computable

Wikipedia is a common denominator in question answering and commercial natural language applications that leverage artificial intelligence. Witness Siri, Watson, Cortana and Google Now, among others. DBpedia is a structured data representation of Wikipedia that makes much of this content machine readable. Wikidata is a multilingual repository of 18 million structured entities now feeding the Wikipedia ecosystem. The availability of these sources is remaking and accelerating the role of knowledge bases in powering the next generation of artificial intelligence applications. But much, much more is possible.

All of these noted knowledge bases lack a comprehensive and coherent knowledge structure. They are not computable, nor able to be reasoned over or inferenced. While they are valuable resources for structured data and content, the vast potential in these storehouses remains locked up. Yet the relevance of these sources to drive an artificial intelligence platform geared to data and content is profound.

And what makes this potential profound? Well, properly structured, knowledge bases can provide the features and generation of positive and negative training sets useful to machine learning. Coherent organization of the knowledge graph within the KB’s domain enables various forms of reasoning and inference, further useful to making fine-grained recognizers, extractors and classifiers applicable to external knowledge. As I have pointed out before with regard to knowledge-based artificial intelligence (or KBAI) [1], these capabilities can work to extract still more accurate structure and knowledge from the knowledge base, creating a virtuous circle of still further improvements.

In all fairness, the Wikipedia ecosystem was not designed to be a computable one. But the free and open access to content in the Wikipedia ecosystem has sparked an explosion of academic and commercial interest in using this knowledge, often in DBpedia machine-readable form. Yet, despite this interest and more than 500 research papers in areas leveraging Wikipedia for AI and NLP purposes [2], the efforts remain piecemeal and unconnected. Yes, there is valuable structure and content within the knowledge bases; yes, they are being exploited both for high-value bespoke applications and for small research projects; but, across the board, these sources are not being used or leveraged in anything approaching a systematic nature. Each distinct project requires anew its own preparations and staging.

And it is not only Wikipedia that is neglected as a general resource for AI and semantic technology applications. One is hard-pressed to identify any large-scale knowledge base, available in electronic form, that is being sufficiently and systematically exploited for AI or semantic technology purposes [3]. This gap is really rather perplexing. Why the huge disconnect between potential and reality? Could this gap somehow be related to also why the semantic technology community continues to bemoan the lack of “killer apps” in the space? Is there something possibly even more fundamental going on here?

I think there is.

We have spent eight years so far on the development and refinement of UMBEL [4]. It was intended initially to be a bridge between unruly Web content and reasoning capabilities in Cyc to enable information interoperability on the Web [5]; an objective it still retains. Naturally, Wikipedia was the first target for mapping to UMBEL [6]. Through our stumbling and bumbling and just serendipity, we have learned much about the specifics of Wikipedia [6], aspects of knowledge bases in general, and the interface of these resources to semantic technologies and artificial intelligence. The potential marriage between Cyc, UMBEL and Wikipedia has emerged as a first-class objective in its own right.

What we have learned is that it is not any single thing, but multiple things, that is preventing knowledge bases from living up to their potential as resources for artificial intelligence. As I trace some of the sources of our learning below, note that it is a combination of conceptual issues, architectural issues, and terminological issues that need to be overcome in order to see our way to a simpler and more responsive approach.

The Learning Process Began with UMBEL’s SuperTypes

Shortly after the initial release of UMBEL we undertook an effort in 2009 to split it into a number (initially 33) of mostly disjoint “super types” [7]. This logical segmentation was done for practical reasons of efficiency and modularity. It forced us to consider what is a “concept” and what is an “entity”, among other logical splits. It caused us to inspect the entire UMBEL knowledge space, and to organize and arrange and inspect the various parts and roles of the space.

We began to distinguish “attributes” and “relations” as separate from “concepts” and “entities”. Within the clustering of “entities” we could also see that some things were distinct individuals or entity instances, while other terms represented “types” or classes of like entities. At that time, “named entity” was a more important term of art than is practiced today. In looking at this idea we noted [7]:

The intuition surrounding “named entity” and nameable “things” was that they were discrete and disjoint. A rock is not a person and is not a chemical or an event. … some classes of things could also be treated as more-or-less distinct nameable “things”: beetles are not the same as frogs and are not the same as rocks. While some of these “things” might be a true individual with a discrete name, such as Kermit the Frog, or The Rock at Northwestern University, most instances of such things are unnamed. . . . The “nameability” (or logical categorization) of things is perhaps best kept separate from other epistemological issues of distinguishing sets, collections, or classes from individuals, members or instances.

Because we were mapping Cyc and Wikipedia using UMBEL as the intermediary, we noticed that some things were characterized as a class in one system, while being an instance in the other [8]. In essence, we were learning the vocabulary of knowledge bases, and beginning to see that terminology was by no means consistent across systems or viewpoints.

This reminds me of my experience as an undergraduate, learning plant taxonomy. We had to learn literally hundreds of strange terms such as glabrous or hirsute or pinnate, all terms of art for how to describe leaves, their shapes, their hairiness, fruits and flowers and such. What happens, though, when one learns the terminology of a domain is that one’s eyes are opened to see and distinguish more. What had previously been for me a field of view merely of various shades of green and shrubs and trees, emerged as distinct species of plants and individual variants that I could clearly discern and identify. As I learned nuanced distinctions I begin to be able to see with greater clarity. In a similar way, the naming and distinguishing of things in our UMBEL SuperTypes was opening up our eyes to finer and more nuanced distinctions in the knowledge base. All of this striving was in order to be able to map the millions and millions of things within Wikipedia to a different, coherent structure provided by Cyc and UMBEL.

ABox – TBox and Architectural Basics

One of the clearest distinctions that emerged was the split between the TBox and the ABox in the knowledge base, the difference between schema and instances. Through the years I have written many articles on this subject [9]. It is fundamental to understand the differences in representation and work between these two key portions of a knowledge base.

Instances are the specific individual things in the KB that are relevant to the domain. Instances can be many or few, as in the millions within Wikipedia, accounting for more than 90% of its total articles. Instances are characterized by various types of structured data, provided as key attribute-value pairs, and which may be explained through long or short text descriptions, may have multiple aliases and synonyms, may be related to other instances via type or kind or other relations, and may be described in multiple languages. This is the predominant form of content within most knowledge bases, perhaps best exemplified by Wikipedia.

The TBox, on the other hand, needs to be a coherent structural description of the domain, which expresses itself as a knowledge graph with meaningful and consistent connections across its concepts. Somewhat irrespective of the number of instances (the ABox) in the knowledge base, the TBox is relatively constant in size given a desired level of descriptive scope for the domain. (In other words, the logical model of the domain is mostly independent from the number of instances in the domain.)

For a reference structure such as UMBEL, then, the size of its ontology (TBox) can be much smaller and defined with focus, while still being able to refer to and incorporate millions of instances, as is the case for Wikipedia (or virtually any large knowledge base). Two critical aspects for the TBox thus emerge. First, it must be a coherent and reasonable “brain” for capturing the desired dynamics and relationships of the domain. And, second, it must provide a robust, flexible, and expandable means for incorporating instance records. This latter “bridging” purpose is the topic of the next sub-section.

The TBox-ABox segregation, and how it should work logically and pragmatically, requires much study and focus. It is easy to read the words and even sometimes to write them, but it has taken us many years of practice and much thought and experimentation to derive workable information architectures for realizing and exploiting this segregation.

I have previously spelled out seven benefits from the TBox-ABox split [10], but there is another one that arises from working through the practical aspects of this segregation. Namely, an effective ABox-TBox split compels an understanding of the roles and architecture of the TBox. It is the realization of this benefit that is the central basis for the insights provided in this article.

We’ll be spelling out more of these specifics in the sections below. These understandings help us define the composition and architecture of the TBox. In the case of the current development version of UMBEL [11], here are the broad portions of the TBox:

Distribution of Types in the UMBEL TBox

Distribution of Types in the UMBEL TBox

Structures (types) for organizing the entities in the domain constitute nearly 90% of the work of the TBox. This reflects the extreme importance of entity types to the “bridging” function of the TBox to the ABox.

Probing the Concept of ‘Entities’

Most of the instances in the ABox are entities, but what is an “entity”? Unfortunately, that is not universally agreed. In our parlance, an “entity” and related terms are:

  • The basic, real things in our domain of interest: entities
  • The way we characterize and describe those individual things: attributes
  • The way we describe connections between two or more of those things: relations, and
  • Aggregations or collections or classes of similar entities, which also share some essence: entity types.

We no longer use the term named entities, though nouns with proper names are almost always entities. By definition, entities can not be topics or types and entities are not data types. Some of the earlier typologies by others, such as Sekine [12], also mix the ideas of attributes and entities; we do not. Lastly, by definition, entity types have the same attribute “slots” as all type members, even if no data is assigned in many or most cases. The glossary presents a complete compilation of terms and acronyms used.

The role for the label “entity” can also refer to what is known as the root node in some systems such as SUMO [13]. In the OWL language and RDF data model we use, the root node is known as “thing”. Clearly, our use of the term “entity” is much different than SUMO and resides at a subsidiary place in the overall TBox hierarchy. In this case, and frankly for most semantic matches, equivalences should be judged with care, with context the crucial deciding factor.

Nonetheless, most practitioners do not use “entity” in a root manner. Some of the first uses were in the Message Understanding Conferences, especially MUC-6 and MUC-7 in 1995 and 1997, where competitions for finding “named entities” were begun, as well as the practice of in-line tagging [14]. However, even the original MUC conferences conflated proper names and quantities of interest under “named entities.” For example, MUC-6 defined person, organization, and location names, all of which are indeed entity types, but also included dates, times, percentages, and monetary amounts, which we define as attribute types.

It did not take long for various groups and researchers to want more entity types, more distinctions. BBN categories, proposed in 2002, were used for question answering and consisted of 29 types and 64 subtypes [15]. Sekine put forward and refined over many years his Extended Entity Types, which grew to about 200 types [12]. But some of these accepted ‘named entities‘ are also written in lower case, with examples such as rocks (‘gneiss’) or common animals or plants (‘daisy’) or chemicals (‘ozone’) or minerals (‘mica’) or drugs (‘aspirin’) or foods (‘sushi’) or whatever. Some deference was given to the idea of Kripke’s “rigid designators” as providing guidance for how to identify entities; rigid designators include proper names as well as certain natural kind of terms like biological species and substances. Because of these blurrings, the nomenclature of “named entities” began to fade away.

But it did not take but a few years where the demand was for “fine-grained entity” recognition, and scope and numbers of types continued to creep up. Here are some additional data points to what has already been mentioned:

  • DBpedia Ontology: 738 types [16]
  • 636 types [17]
  • YAGO: 505 types; see also HYENA [18]
  • Lee et al.: 147 types [19]
  • FIGER: 112 types [20]
  • Gillick: 86 types [21]
  • OpenCalais: 42 types [22]
  • GeoNames: 654 “feature codes” [23]
  • Nadeau: ~100 types [24].

Lastly, the new version of UMBEL has 25,000 entity types, in keeping with this growth trend and for the “bridging” reasons discussed below.

We can plot this out over time on log scale to see that the proposed entity types have been growing exponentially:

Growth in Recognition of Entity Types

Growth in Recognition of Entity Types

This growth in entity types comes from wanting to describe and organize things with more precision. No longer do we want to talk broadly about people, but we want to talk about astronauts or explorers. We don’t just simply want to talk about products, but categories of manufactured goods like cameras or sub-types like SLR cameras or further sub-types like digital SLR cameras or even specific models like the Canon EOS 7D Mark II (skipping over even more intermediate sub-types). With sufficient instances, it is possible to train recognizers for these different entity types.

What is appropriate for a given domain, enterprise or particular task may vary the depth and scope of what entity types should be considered, which we can also refer to as context. For example, the toucan has sometimes been used as a example of how to refer to or name a thing on the Web [25]. When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we display is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how physically divergent these various “toucans” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture of the toucan is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.

The point is not a lesson on toucans, but an affirmation that distinctions between what we think we may be describing occurs over multiple levels. Just as there is no self-evident criteria as to what constitutes an “entity type”, there is also not a self-evident and fully defining set of criteria as to what the physical “toucan” bird should represent. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.

Both in terms of historical usage and trying to provide a framework for how to relate these various understandings of entities and types, we can thus see this kind of relationship:

Evolving Sophistication of Entity Types

Evolving Sophistication of Entity Types

What we see is an entities typology that provides the “bridging” interface between specific entity records and the UMBEL reasoning layer. This entities typology is built around UMBEL’s existing SuperTypes. The typology is the result of the classification of things according to their shared attributes and essences. The idea is that the world is divided into real, discontinuous and immutable ‘kinds’. Expressed another way, in statistics, typology is a composite measure that involves the classification of observations in terms of their attributes on multiple variables. In the context of a global KB such as Wikipedia, about 25,000 entity types are sufficient to provide a home for the millions of individual articles in the system.

Each SuperType related to entities has its own typology, and is generally expressed as a taxonomy of 3-4 levels, though there are some cases where the depth is much greater (9-10 levels) [26]. There are two flexible aspects to this design. First, because the types are “natural” and nested [27], coarse entity schema can readily find a correspondence. Second, if external records have need for more specificity and depth, that can be accommodated through a mapped bridging hierarchy as well. In other words, the typology can expand and contract like a squeezebox to map a range of specificities.

The internal process to create these typologies also has the beneficial effect of testing placements in the knowledge graph and identifying gaps in the structure as informed by fragments and orphans. The typologies should optimally be fully connected in order to completely fulfill their bridging function.

Extending the Mindset to Attributes and Relations

As with our defined terminology above [28], we can apply this same mindset to the characterizations (attributes) of entities and the relations between entities and TBox concepts or topics. Numerically, these two categories are much less prevalent than entity types. But, the construction and use of the typologies are roughly the same.

Since we are using RDF and OWL as our standard model and languages, one might ask why we are not relying on the distinction of object and datatype properties for these splits. Relations, it is true, by definition need to be object properties, since both subject and object need to be identified things. But attributes, in some cases such as rating systems or specific designators, may also refer to controlled vocabularies, which can (and, under best practice, should) be represented as object properties. So, while most attributes are built around datatype properties, not all are. Relations and attributes are a better cleaving, since we can use relations as patterns for fact extractions and the organization of attributes give us a cross-cutting means to understand the characteristics of things independent of entity type. These all become valuable potential features for machine learning, in addition to the standard text structure.

Though, today, UMBEL is considerably more sophisticated in its entities typologies, we already have a start on an attributes typology by virtue of the prior work on the Attributes Ontology [29], which will be revised to conform to this newer typology model. We also have a start on a relations typology based on existing OWL and RDF predicates used in UMBEL, plus many candidates from the Activities SuperType. As with the entities typology, relation types and attribute types may also have hierarchy, enabling reasoning and the squeezebox design. As with the entities typology, the objective is to have a fully connected hierarchy, of perhaps no more than 3-4 levels depth, with no fragments and no orphans.

A Different Role for Annotations

Annotations about how we label things and how we assign metadata to things resides at a different layer than what has been discussed to this point. Annotations can not be reasoned over, but they can and do play pivotal roles. Annotations are an important means for tagging, matching and slicing-and-dicing the information space. Metadata can perform those roles, but also may be used to analyze provenance and reasoning, if the annotations are mapped to object or datatype properties.

Labels are the means to broaden the correspondence of real-world reference to match the true referents or objects in the knowledge base. This enables the referents to remain abstract; that is, not tied to any given label or string. In best practice we recommend annotations reflect all of the various ways a given object may be identified (synonyms, acronyms, slang, jargon, all by language type). These considerations improve the means for tagging, matching, and slicing-and-dicing, even if the annotations are not able to be reasoned over.

As a mental note for the simple design that follows, imagine a transparent overlay, not shown, upon which best-practice annotations reside.

A Simple Design Brings it All Together

The insights provided here have taken much time to discover; they have arisen from a practical drive to make knowledge bases computable and useful to artificial intelligence applications. Here is how we now see the basics of a knowledge base, properly configured to be computable, and open to integration with external records:

Boiling KBs Down to Basics

Boiling KBs Down to Basics

At the broadest perspective, we can organize our knowledge-base platform into a “brain” or organizer/reasoner, and the instances or specific things or entities within our domain of interest. We can decompose a KB to become computable by providing various type groupings for our desired instance mappings, and an upper reasoning layer. An interface layer of “types”, organized into three broad groupings, provides the interface, or “bridging” layer, between the TBox and ABox. We thus have an architectural design segregating:

  • Topics and upper level — the general organization and “brains” of the domain
  • Entity types — categorizations of the actual things in the space
  • Relation types — the ways that different things are related to, or act upon, one another
  • Attribute types — a structured organization of the ways that individual entities can be described
  • Instances — the individual entities of the domain, and
  • Properties — the source grist for annotations, relation types and attribute types.

Becoming familiar with this terminology helps to show how the conventional understanding of these terms and structure have led to overlooking key insights and focusing (sometimes) on the wrong matters. That is in part why so much of the simple logic of this design has escaped the attention of most practitioners. For us, personally, at Structured Dynamics, it had eluded us for years, and we were actively seeking it.

Irrespective of terminology, the recognition of the role of types and their bridging function to actual instance data (records) is central to the computability of the structure. It also enables integration with any form of data record or record stores. The ability to understand relation types leads to improved relation extraction, a key means to mine entities and connections from general content and to extend assertions in the knowledge base. Entity types provide a flexible means for any entity to connect with the computable structure. And, the attribute types provide an orthogonal and inferential means to slice the information space by how objects get characterized.

Because of this architecture, the reference sources guiding its construction, its typologies, its ability to generate features and training sets, and its computability, we believe this overall design is suited to provide an array of AI and enterprise services:

Machine Intelligence Apps and Services
  • Entity recognizers
  • Relation extractors
  • Event extractors
  • Phrase identification
  • Classifiers
  • Q & A systems
  • Cognitive computing
  • Semantic publishing
  • Knowledge base mappings
  • Sub-graph extraction
  • Ontology development
  • Ontology mappers
  • Entity dictionaries
  • Entity linkers
  • Data conversion and mapping
  • Master data management
  • KB improvements
  • Attribute “slot filling”
  • Disambiguators
  • Duplicates removal
  • Inference and reasoning
  • Sentiment analysis
  • Semantic relatedness
  • Recommendation systems
  • Bespoke analysis
  • Bespoke platforms

By cutting through the clutter — conceptually and terminologically — it has been possible to derive a practical and repeatable design to make KBs computable. Being able to generate features and positive and negative training sets, almost at will, is proving to be an effective approach to machine learning at mass-produced prices.

[1] See M. K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” from AI3:::Adaptive Information blog, November 17, 2014. For additional writings in the series, see
[2] See Fabian M. Suchanek and Gerhard Weikum, 2014. “Knowledge Bases in the Age of Big Data Analytics,Proceedings of the VLDB Endowment 7, no. 13 (2014) and M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010; the listing as of its last update included 246 articles. Also, see Wikipedia’s own “Wikipedia in Academic Studies.”
[3] A possible exception to this observation is the biomedical community through its Open Biological and Biomedical Ontologies (OBO) initiative.
[4] See M.K. Bergman, 2007. “Announcing UMBEL: A Lightweight Subject Structure for the Web,” AI3:::Adaptive Information blog, July 12, 2007. Also see
[5] See M.K. Bergman, 2008. “Basing UMBEL’s Backbone on OpenCyc,” AI3:::Adaptive Information blog, April 2, 2008.
[6] See M.K. Bergman, 2015. “Shaping Wikipedia into a Computable Knowledge Base,” AI3:::Adaptive Information blog, March 31, 2015.
[7] M.K. Bergman, 2009. ‘SuperTypes’ and Logical Segmentation of Instances, AI3:::Adaptive Information blog, September 2, 2009.
[8] This possible use of an item as both a class and an instance through “punning” is a desirable feature of OWL 2, which is the language basis for UMBEL. You can learn more on this subject in M.K. Bergman, 2010. “Metamodeling in Domain Ontologies,” AI3:::Adaptive Information blog, September 20, 2010.
[9] For a listing of these, see the Google query One of the 40 articles with the most relevant commentary to this article is M.K. Bergman, 2014. “Big Structure and Data Interoperability,” AI3:::Adaptive Information blog, August 14, 2014.
[10] M.K. Bergman, 2009. ” Making Linked Data Reasonable using Description Logics, Part 1,” AI3:::Adaptive Information blog, February 11, 2009.
[11] The current development version of UMBEL is v 1.30. It is due for release before the end of 2015.
[12] See the Sekine Extended Entity Types; the listing also includes attributes info at bottom of source page.
[14] N. Chinchor, 1997. “Overview of MUC-7,” MUC-7 Proceedings, 1997.
[15] Ada Brunstein, 2002. “Annotation Guidelines for Answer Types”. LDC Catalog, Linguistic Data Consortium. Aug 3, 2002.
[16] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann, 2009. “DBpedia-A Crystallization Point for the Web of Data.” Web Semantics: science, services and agents on the world wide web 7, no. 3 (2009): 154-165; 170 classes in this paper. That has grown to more than 700; see and
[17] The listing is under some dynamic growth. This is the official count as of September 8, 2015, from Current updates are available from Github.
[18] Joanna Biega, Erdal Kuzey, and Fabian M. Suchanek, 2013. “Inside YAGO2: A Transparent Information Extraction Architecture,” in Proceedings of the 22nd international conference on World Wide Web, pp. 325-328. International World Wide Web Conferences Steering Committee, 2013. Also see Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, Gerhard Weikum, 2012. “HYENA: Hierarchical Type Classification for Entity Names,” in Proceedings of the 24th International Conference on Computational Linguistics, Coling 2012, Mumbai, India, 2012.
[19] Changki Lee, Yi-Gyu Hwang, Hyo-Jung Oh, Soojong Lim, Jeong Heo, Chung-Hee Lee, Hyeon-Jin Kim, Ji-Hyun Wang, and Myung-Gil Jang, 2006. “Fine-grained Named Entity Recognition using Conditional Random Fields for Question Answering,” in Information Retrieval Technology, pp. 581-587. Springer Berlin Heidelberg, 2006.
[20] Xiao Ling and Daniel S. Weld, 2012. “Fine-Grained Entity Recognition,” in AAAI. 2012.
[21] Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh, 2104. “Context-Dependent Fine-Grained Entity Type Tagging.” arXiv preprint arXiv:1412.1820 (2014).
[24] David Nadeau, 2007. “Semi-supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision.” PhD Thesis, School of Information Technology and Engineering, University of Ottawa, 2007.
[25] M.K. Bergman, 2012. ” Give Me a Sign: What Do Things Mean on the Semantic Web?,” AI3:::Adaptive Information blog, January 24, 2012.
[26] A good example of description and use of typologies is in the archaelogy description on Wikipedia.
[27] M.K. Bergman, 2015. “‘Natural’ Classes in the Knowledge Web“, AI3:::Adaptive Information blog, July 13, 2015.
[28] Also see my Glossary for definitions of specific terminology used in this article.
[29] M.K. Bergman, 2015. “Conceptual and Practical Distinctions in the Attributes Ontology“, AI3:::Adaptive Information blog, March 3, 2015.
Posted:September 17, 2015

AI3 Pulse

In keeping with the expanding topics of knowledge-based artificial intelligence (KBAI), I have done a thorough update of my older Acronyms and Glossary page on this blog.

Besides correcting some errors and updating the listings, the major changes were to bring in an earlier post on a semantic technologies glossary and to greatly expand the glossary to include knowledge base and artificial intelligence topics.

Please let me know if you see any errors. I also welcome suggestions for new entries that should be added to the list.

Posted by AI3's author, Mike Bergman Posted on September 17, 2015 at 12:04 pm in Pulse, Site-related | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is: