Posted:February 8, 2016

Pulse: The Starting Point for Feature Selection

A Needed Focus on the Inputs to Machine Learners

Features are the inputs to machine learners. The outputs of machine learners are predictions of outcomes, based on an inferred (or “learned”) model or function. In image recognition, as an example, the inputs are the characteristics of pixels and those adjacent to them; the output may be a prediction there is an image representation of “cat”. In NLP, as another case, the input might be the text, title and URL of emails; the output may be a prediction of “spam”. If we treat all ML learners as black boxes, features are what is fed to the box, and predicted labels or structures are what comes out.

As I recently argued, the importance of features has been overlooked in comparison to the choice of machine learners or how to lower the costs and efforts of labeling and creating training sets and standards. The complete picture needs to include feature extraction, feature selection, and feature engineering.

A recent review paper helps redress this imbalance. Feature Selection: A Data Perspective [1], surveys and provides a comprehensive and well-organized overview of recent advances in feature selection research. According to the authors, Li et al., “the objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and helping prepare, clean, and understand data.” The practical 73-page review is accompanied by an open-source feature selection library that consists of most of the popular feature selection algorithms covered in the review, and a comprehensive performance analysis of the methods and their results.

The first nine pages of the review are devoted to a broad, accessible overview. The intro provides a clear explanation of features and their role in feature selection. It also explains why the high-dimensionality of features is a challenge in its own right.

The bulk of the document is devoted to a discussion of the various methods used in feature selection, organized according to:

generic data
structure features
heterogeneous data, and
streaming data.

Each of the methods is characterized as to whether it is applicable to supervised or unsupervised learning. While I have used a different classification of the feature space, that does not affect the usefulness of Li et al.’s [1] approach. Also, in keeping with a review article, there are more than 11 pages of references containing nearly 150 citations.

The combined review nature of the paper also means that various methods have been reduced to a common symbol set, which is a handy way to relate available features to multiple learners. This common treatment enables the authors to create the open source repository, scikit-feast, written in Python and available from Github, that provides a library of 25 of the methods covered. A separate Web site presents some test datasets and performance results. Here is one example of many of the available results:

This paper deserves a permanent place on anyone’s resource shelf who has a serious interest in machine learning. I would not be surprised to see the authors’ organizational structure of feature selection methods become a standard. It is always a pleasure to encounter papers that are well-written, understandable and comprehensive. Great job!

[1] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, Huan Liu, 2016. “Feature Selection: A Data Perspective,” arXiv:1601.07996, 29 Jan 2016.

Posted:February 1, 2016

Pulse: The Biggest of the Big Pictures on Machine Learning

A Great Introduction to ML and Its Roots

I have to admit the first I heard the title of Pedro Domingos‘ recent book, The Master Algorithm, I was off-put, similar to the way I react negatively to the singularity made famous by Ray Kurzweil. I don’t tend to buy into single answers or theories of everything.

But as a recent talk by Domingos at Google shows, he has much more insight to share about the roots and “tribes” associated with machine learning. If you are new to ML and want to learn more about the big picture underlying its main approaches and tenets, the hour spent watching this video will prove valuable:

The strength of the talk is to describe what Domingos calls the five “tribes” underlying machine learning and the lead researchers, premises and approaches underlying each:

Symbolists — based in logic, this approach attempts to model the composition of knowledge by inverting the deductive process
Connectionists — also known as neural networks or deep learning, this mindset is grounded most in trying to mimic how the brain actually works
Evolutionists — the biological evolution of life of mixing genes through reproduction as altered by mutations and cross-overs guides these genetic algorithms
Bayesians — since the world is uncertain, likely outcomes are guided by statistical probabilities, which also change as new evidence is constantly brought to bear
Anagolizers — this tribe attempts to reason by analogy by looking for similarities to examples or closely related factors.

You can also see the slides here to Domingos’ talk.

As Domingos emphasizes, each of these approaches has its applications, strengths and weaknesses. He posits there are shared aspects and generalities underlying all of these methods that can help point the way to perhaps more universal approaches, the master algorithm.

I have argued elsewhere about the importance of knowledge bases to recent AI breakthroughs more than algorithms, but ultimately, of course, specific calculation methods need to underpin any learning approach. Though I’m not convinced there is a “master” algorithm, there is also great value in understanding the premises and mindsets behind these main approaches to machine learning.

Posted:January 27, 2016

If Big Data is One Answer to AI, What is the Question?

Part II in Our Series on the Resurgency of Artificial Intelligence

In Part I of this series we pointed to the importance of large electronic knowledge bases (Big Data) in the recent advances in AI for knowledge- and text-based systems. Amongst other factors such as speedy GPUs and algorithm advances, we noted that electronic knowledge bases are perhaps the most important factor in the resurgence of artificial intelligence.

But the real question is not what the most important factor in the resurgence of AI may be — though the answer to that points to vetted, reference training sources. The real question is: How can we systematize this understanding by improving the usefulness of knowledge bases to support AI machine learning? Knowing that knowledge bases are important is not enough. If we can better understand what promotes — and what hinders — KBs for machine learning, perhaps we can design KB approaches that are even quicker and more effective for AI. In short, what should we be coding to promote knowledge-based artificial intelligence (KBAI)?

Why is There Not a Systematic Approach to Using KBs in AI?

To probe this question, let’s take the case of Wikipedia, the most important of the knowledge bases for AI machine learning purposes. According to Google Scholar, there have been more than 15,000 research articles relating Wikipedia to machine learning or artificial intelligence [1]. Growth in articles noting the use of Wikipedia for AI has been particularly strong in the past five years [1].

But two things are remarkable about this use of Wikipedia. First, virtually every one of these papers is a one-off. Each project stages and uses Wikipedia in its own way and with its own methodology and focus. Second, Wikipedia is able to make its contributions despite the fact there are many weaknesses and gaps within the knowledge base itself. Clearly, despite weaknesses, the availability of such large-scale knowledge in electronic form still provides significant advantages to AI and machine learning in the realm of natural language applications and text understanding. How much better might our learners be if we fed them more complete and coherent information?

Readers of this blog will be familiar with my periodic criticisms of Wikipedia as a structured knowledge resource [2]. These criticisms are not related to the general scope and coverage of Wikipedia, which, overall, is remarkable and unprecedented in human endeavor. Rather, the criticisms relate to the use of Wikipedia as is for knowledge representation purposes. To recap, here are some of the weaknesses of Wikipedia as a knowledge resource for AI:

Incoherency — the category structure of Wikipedia is particularly problematic. More than 60% of existing Wikipedia categories are not true (or “natural”) categories at all, but represent groupings more of convenience or compound attributes (such as Films directed by Pedro Almodóvar or Ambassadors of the United States to Mexico) [3]
Incomplete structure — attributes, as presented in Wikipedia’s infoboxes, are incomplete within and across entities, and links also have gaps. Wikidata offers promise to help bring greater consistency, but much will need to be achieved with bots and provenance remains an issue
Incomplete coverage — the coverage and scope of Wikipedia are spotty, especially across language versions, and in any case the entities and concepts covered need to meet Wikipedia’s notability guidelines. For much domain analysis, Wikipedia’s domain coverage is inadequate. It would also be helpful if there were ways to extend the KB’s coverage for local or enterprise purposes
Inaccuracies — actually, given its crowdsourced nature, popular portions of Wikipedia are fairly vetted for accuracy. Peripheral or stub aspects of Wikipedia, however, may retain inaccuracies of coverage, tone or representation.

As the tremendous use of Wikipedia for research shows, none of these weaknesses is fatal, and none alone has prevented meaningful use of the knowledge base. Further, there is much active research in areas such as knowledge base population [4] that promise to aid solutions to some of these weaknesses. The recognition of the success of knowledge bases to train AI machine learners is also now increasing awareness that KB design for AI purposes is a worthwhile research topic in its own right. Much is happening leveraging AI in bot designs for both Wikipedia and Wikidata. A better understanding of how to test and ensure coherency, matched with a knowledge graph for inferencing and logic, should help promote better and faster AI learners. The same techniques for testing consistency and coherence may be applied to mapping external KBs into such a reference structure.

Thus, the real question again is: How can we systematize the usefulness of knowledge bases to support AI machine learning? Simply by asking this question we can alter our mindset to discover readily available ways to improve knowledge bases for KBAI purposes.

Working Backwards from the Needs of Machine Learners

The best perspective to take on how to optimize knowledge bases for artificial intelligence derives from the needs of the machine learners. Not all individual learners have all of these needs, but from the perspective of a “platform” or “factory” for machine learners, the knowledge base and supporting structures should consider all of these factors:

features — are the raw input to machine learners, and may be evident, such as attributes, syntax or semantics, or may be hidden or “latent” [5]. A knowledge base such as Wikipedia can expose literally hundreds of different feature sets [5]. Of course, only a few of these feature types are useful for a given learning task, and many duplicate ones provide nearly similar “signals”. But across multiple learning tasks, many different feature types are desirable and can be made available to learners
knowledge graph — is the schematic representation of the KB domain, and is the basis for setting the coherency and logic and inference structure for the represented knowledge. The knowledge graph, provided in the form of an ontology, is the means by which logical slices can be identified and cleaved for such areas as entity type selection or the segregation of training sets. In situations like Wikipedia where the existing category structure is often incoherent, re-expressing the existing knowledge into an existing and proven schema is one viable approach
positive and negative training sets — for supervised learning, positive training sets provide a group of labeled, desired outputs, while negative training sets are similar in most respects but do not meet the desired conditions. The training sets provide the labeled outputs to which the machine learner is trained. Provision of both negative and positive sets is helpful, and the accuracy of the learner is in part a function of how “correct” the labeled training sets are
reference (“gold”) standards — vetted reference results, which are really validated training sets and therefore more rigorous to produce, are important to test the precision, recall and accuracy of machine learners [6]. This verification is essential during the process of testing the usefulness of various input features as well as model parameters. Without known standards, it is hard to converge many learners for effective predictions

Purposeful Knowledge Bases

keeping the KBs current — the nature of knowledge is that is it is constantly changing and growing. As a result, knowledge bases used in KBAI are constantly in flux. The restructuring and feature set generation from the knowledge base must be updated on a periodic basis. Keeping KBs current means that the overall process of staging knowledge bases for KBAI purposes must be made systematic through the use of scripts, build routines and validation tests. General administrative and management capabilities are also critical, and
repeatability — all of these steps must be repeatable, since new iterations of the knowledge base must retain coherency and consistency.

Thus, an effective knowledge base to support KBAI should have a number of desirable aspects. It should have maximum structure and exposed features. It should be organized by a coherent knowledge graph, which can be effectively sliced and reasoned over. It must be testable via logic and consistency and performance tests, such that training and reference sets may be vetted and refined. And it must have repeatable and verifiable scripts for updating the underlying KB as it changes and to generate new, working feature sets. Moreover, means for generally managing and manipulating the knowledge base and knowledge graph are important. These desirable aspects constitute a triad of required functionality.

Guidance for a Research Agenda

Achieving a purposeful knowledge base for KBAI uses is neither a stretch nor technically risky. We see the broad outlines of the requirements in the discussion above.

Two-thirds of the triad are relatively straightforward. First, creating a platform for managing the KBs is a fairly standard requirement for knowledge and semantics purposes; many platforms presently exist that can access and manage knowledge graphs, knowledge bases, and instance data at the necessary scales. Second, the build and testing scripts do require systematic attention, but these are also not difficult tasks and are quite common in many settings. It is true that build and testing scripts can often prove brittle, so care needs to be placed into their design to facilitate maintainability. Fortunately, these are matters mostly of proper focus and good practice, and not conceptually difficult.

The major challenges reside in the third leg of the triad, namely in the twin needs to map the knowledge base into a coherent knowledge graph and into an underlying speculative grammar that is logically computable to support the expression of both feature and training sets. A variety of upper ontologies and lexical frameworks such as WordNet have been tried as the guiding graph structures for knowledge bases [7]. To my knowledge, none of these options has been scrutinized with the specific requirements of KBAI support in mind. With respect to the other twin need, that of a speculative grammar, our research to date [8] points to the importance of segregating the KB information into topics (concepts), relations and relation types, entities and entity types, and attributes and attribute types. The possible further distinctions, however, into possibly roles, annotations (metadata), events and expressions of mereology still require further research. The role, placement and use of rules also remain to be determined.

You can see Part I of
this series here.

[1] See, for example, this Google Scholar query: https://scholar.google.com/scholar?q=wikipedia+%22machine+learning%22+OR+%22artificial+intelligence%22. Growth data may be obtained by annual date range searches.

[2] Over the years I have addressed this topic in many articles. A recent sampling from my AI3:::Adaptive Information blog is “Shaping Wikipedia into a Computable Knowledge Base,” (March 31, 2015); and “Creating a Platform for Knowledge-based Machine Intelligence” (September 21, 2015) .

[3] See, for example, M.K. Bergman, 2015. “‘Natural Classes’ in the Knowledge Web,” AI3:::Adaptive Information blog, July 13, 2015.

[4] Knowledge base population, or KBP, first became a topic of research with the track by the same name starting at the Text Analysis Conference sponsored by NIST in 2009. The workshop track has continued annually ever since with greater prominence, and has been the initiative of many projects mining open sources for facts and assertions.

[5] See, for example, M.K. Bergman, 2015. “‘A (Partial) Taxonomy of Machine Learning Features,” AI3:::Adaptive Information blog, November 23, 2015.

[6] See, for example, M.K. Bergman, 2015. “‘A Primer on Knowledge Statistics,” AI3:::Adaptive Information blog, March 18, 2015.

[7] See, for example, M.K. Bergman, 2011. “‘In Search of ‘Gold Standards’ for the Semantic Web,” AI3:::Adaptive Information blog, February 28, 2011.

[8] Two primary articles that I have written on my AI3:::Adaptive Information blog on the information structure of knowledge bases bear on this question; see “Creating a Platform for Knowledge-based Machine Intelligence” (September 21, 2015), and “Conceptual and Practical Distinctions in the Attributes Ontology” (March 3, 2015).

Posted:January 25, 2016

Why the Resurgence in AI?

Artificial Intelligence is in Bloom; But it Was Not Always So

Anyone beyond a certain age may recall the waning and waxing of the idea of AI, artificial intelligence. In fact, the periodic dismal prospects and poor reputation of artificial intelligence have been severe enough at times so as to warrant its own label: the “AI winters.” Clearly, today, we are in a resurgence of AI. But why is this so? Is the newly re-found popularity of AI merely a change in fashion, or is it due to more fundamental factors? And if it is more fundamental, what might those factors be that have led to this resurgence?

We only need to look at the world around us to see that the resurgence in AI is due to real developments, not a mere change in fashion. From virtual assistants that we can instruct or question by voice command to self-driving cars and face recognition, many mundane or tedious tasks of the past are being innovated away. The breakthroughs are real and seemingly at an increasing pace.

As to the reasons behind this resurgence, more than a year ago, the technology futurist Kevin Kelly got it mostly right when he posited these three breakthroughs [1]:

1. Cheap parallel computation

Thinking is an inherently parallel process, billions of neurons firing simultaneously to create synchronous waves of cortical computation. To build a neural network—the primary architecture of AI software—also requires many different processes to take place simultaneously. Each node of a neural network loosely imitates a neuron in the brain—mutually interacting with its neighbors to make sense of the signals it receives. To recognize a spoken word, a program must be able to hear all the phonemes in relation to one another; to identify an image, it needs to see every pixel in the context of the pixels around it—both deeply parallel tasks. But until recently, the typical computer processor could only ping one thing at a time. . . . That began to change more than a decade ago, when a new kind of chip, called a graphics processing unit, or GPU, was devised for the intensely visual—and parallel—demands of videogames . . . .

2. Big Data

Every intelligence has to be taught. A human brain, which is genetically primed to categorize things, still needs to see a dozen examples before it can distinguish between cats and dogs. That’s even more true for artificial minds. Even the best-programmed computer has to play at least a thousand games of chess before it gets good. Part of the AI breakthrough lies in the incredible avalanche of collected data about our world, which provides the schooling that AIs need. Massive databases, self-tracking, web cookies, online footprints, terabytes of storage, decades of search results, Wikipedia, and the entire digital universe became the teachers making AI smart.

3. Better algorithms

Digital neural nets were invented in the 1950s, but it took decades for computer scientists to learn how to tame the astronomically huge combinatorial relationships between a million—or 100 million—neurons. The key was to organize neural nets into stacked layers. Take the relatively simple task of recognizing that a face is a face. When a group of bits in a neural net are found to trigger a pattern—the image of an eye, for instance—that result is moved up to another level in the neural net for further parsing. The next level might group two eyes together and pass that meaningful chunk onto another level of hierarchical structure that associates it with the pattern of a nose. It can take many millions of these nodes (each one producing a calculation feeding others around it), stacked up to 15 levels high, to recognize a human face. . . .

To these factors I would add a fourth: 4. Distributed architectures (beginning with MapReduce) and new performant datastores (NoSQL, graph DBs, and triplestores). These new technologies, plus some rediscovered, gave us the confidence to tackle larger and larger reference datasets, while also helping us innovate high-performance data representation structures, such as graphs, lists, key-value pairs, feature vectors and finite state transducers. In any case, Kelly also notes the interconnection amongst these factors in the cloud, itself a more general enabling factor. I suppose, too, one could add open source to the mix as another factor.

Still, even though these factors have all contributed, I have argued in my series on knowledge-based artificial intelligence (KBAI) the role of electronic data sets (Big Data) as the most important enabling factor [2]. These reference datasets may range from images for image recognition (such as ImageNet) to statistical compilations from text (such as N-grams or co-occurrences) to more formal representations (such as ontologies or knowledge bases). Knowledge graphs and knowledge bases are the key enablers for AI in the realm of knowledge management and representation.

Some also tout algorithms as the most important source of AI innovation, but Alexander Wissner-Gross in the Edge online magazine comes down squarely on the side of data in AI as the most interesting news in recent science [3]:

. . . perhaps many major AI breakthroughs have actually been constrained by the availability of high-quality training datasets, and not by algorithmic advances. For example, in 1994 the achievement of human-level spontaneous speech recognition relied on a variant of a hidden Markov model algorithm initially published ten years earlier, but used a dataset of spoken Wall Street Journal articles and other texts made available only three years earlier. In 1997, when IBM’s Deep Blue defeated Garry Kasparov to become the world’s top chess player, its core NegaScout planning algorithm was fourteen years old, whereas its key dataset of 700,000 Grandmaster chess games (known as the “The Extended Book”) was only six years old. In 2005, Google software achieved breakthrough performance at Arabic- and Chinese-to-English translation based on a variant of a statistical machine translation algorithm published seventeen years earlier, but used a dataset with more than 1.8 trillion tokens from Google Web and News pages gathered the same year. In 2011, IBM’s Watson became the world Jeopardy! champion using a variant of the mixture-of-experts algorithm published twenty years earlier, but utilized a dataset of 8.6 million documents from Wikipedia, Wiktionary, Wikiquote, and Project Gutenberg updated one year prior. In 2014, Google’s GoogLeNet software achieved near-human performance at object classification using a variant of the convolutional neural network algorithm proposed twenty-five years earlier, but was trained on the ImageNet corpus of approximately 1.5 million labeled images and 1,000 object categories first made available only four years earlier. Finally, in 2015, Google DeepMind announced its software had achieved human parity in playing twenty-nine Atari games by learning general control from video using a variant of the Q-learning algorithm published twenty-three years earlier, but the variant was trained on the Arcade Learning Environment dataset of over fifty Atari games made available only two years earlier.

Examining these advances collectively, the average elapsed time between key algorithm proposals and corresponding advances was about eighteen years, whereas the average elapsed time between key dataset availabilities and corresponding advances was less than three years, or about six times faster, suggesting that datasets might have been limiting factors in the advances.

Seeing these correlations only affirms the importance of looking at knowledge bases from the specific lens of how they may best support training AI machine learners. We see the correlation; it is now time to optimize the expression of these KB potentials. We need to organize the KBs via coherent knowledge graphs and express the KBs in types, entities, attributes and relations representing their inherent, latent knowledge structure. Properly expressed KBs can support creating positive and negative training sets, promote feature set generation and expression, and create reference standards for testing AI learners and model parameters.

Past AI winters arose from lofty claims that were not then realized. Perhaps today’s claims may meet a similar fate.

Yet somehow I don’t think so. The truth is, today, we are seeing rapid progress in AI tasks of increasing usefulness and value all around us. The benefits from what will continue to be seen as ubiquitous AI should now ensure an economic and innovation engine behind AI for many years to come. One way that the AI engine will continue to be fueled is through a systematic understanding of how knowledge bases and their features can work hand in hand with machine learning to more effectively automate and meet our needs.

You can see Part II of
this series here.

[1] Kevin Kelly, 2014. “The Three Breakthroughs That Have Finally Unleashed AI on the World,” in Wired.com, October 27, 2014.

[2] For example, from the perspective of hardware, see Jen-Hsun Huang, 2016. “ Accelerating AI with GPUs: A New Computing Model,” Nvidia blog, January 12, 2016.

[3] Alexander Wissner-Gross, 2016. “2016: What Do You Consider the Most Interesting Recent (Scientific) News? What Makes It Important?: Datasets Over Algorithms,” Edge.org, January 2, 2016.

Posted:November 23, 2015

A (Partial) Taxonomy of Machine Learning Features

Knowledge Bases Enable a More Systematic Approach to Feature Engineering

Download PDF

The two most labor-intensive steps in machine learning for natural language are: 1) feature engineering; and 2) labeling of training sets. Supervised machine learning uses these training sets where every point is an input-output pair, mapping an input, which is a feature, to an output, which is the label. The machine learning consists of inferring (“learning”) a function that maps between these inputs and outputs with acceptable predictive power. This learned function can then be applied to previously unseen inputs in order to predict the output label. The technique is particularly suited to problems of regression or of classification.

It is not surprising that the two most labor-intensive tasks in machine learning involve determining the suitable inputs (features) and correctly labeling the output training labels. Elsewhere in this series I discuss training sets and labeling in detail. For this current article, we focus on features.

“Features” are perhaps the least discussed aspect of machine learning. References are made to how to select them; how to construct, extract or learn them; or how even to overall engineer them. But little guidance is provided as to what features exactly are. There really is no listing or inventory for what “features” might even be considered in the various aspects of natural language or text understanding. In part, I think, because we do not have this concrete feel for features, we also don’t tend to understand how to maximize and systematize the role of features in support of our learning tasks. This aspect provides a compelling rationale for the advantages of properly constructed knowledge bases in support of artificial intelligence, what we have been terming as KBAI in this series.

So, before we can understand how to best leverage features in our KBAI efforts, we need to first define and name the feature space. That effort, in turn, enables us to provide a bit of an inventory for what features might contribute to natural language or knowledge base learning. We then organize that inventory a bit to point out the structural and conceptual relationships among these features, which enables us to provide a lightweight taxonomy for the space.

Since many of these features have not been named or exposed before, we conclude the article with some discussion about what next-generation learners may gain by working against this structure. Of course, since much of this thinking is incipient, there are certainly forks and deadends in what may unfold ahead, but there also will likely be unforeseen expansions and opportunities as well. A systematic view of machine learning in relation to knowledge and human language features — coupled with large-scale knowledge bases such as Wikipedia and Wikidata — can lead to faster and cheaper learners across a very broad range of NLP tasks [1].

What is a Feature?

A “feature is an individual measurable property of a phenomenon being observed” [2]. It is an input to a machine learner, an explanatory variable, sometimes in the form of a function. Features are sometimes equated to attributes, but this is not strictly true, since a feature may be a combination of other features, or a statistical calculation, or an abstraction of other inputs. In any case, a feature must be expressed as a numeric value (including Boolean) upon which the machine learner can calculate its predictions. Machine learner predictions of the output can only be based on these numeric features, though they can be subject to rules and weights depending on the type of learner.

The importance of features and the fact they may be extracted or constructed from other inputs is emphasized in this quote from Pedro Domingos [3]:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. . . . Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are. This is typically where most of the effort in a machine learning project goes. It is often also one of the most interesting parts, where intuition, creativity and ‘black art’ are as important as the technical stuff.”

Many experienced ML researchers make similar reference to the art or black art of features. In broad strokes, a feature may be a surface form, like terms or syntax or structure (such as hierarchy or connections); it may be derived (such as statistical, frequency, weighted or based on the ML model used); it may be semantic (in terms of meanings or relations); or it may be latent, as either something hidden or abstracted from feature layers below it. Unsupervised learning or deep learning features arise from the latent form.

For a given NLP problem domain, features can number into the millions or more. Concept classification, for example, could use features corresponding to all of the unique words or phrases in that domain. Relations between concepts could also be as numerous. To assign a value to such “high-dimensional” features, some form of vector relationship is calculated over, say, all of the terms in the space so that each term can be represented numerically [4]. Because learners may learn over multiple feature types, the potential combinations to be evaluated for the ML learner can literally be astronomical. This combinatorial problem has been known for decades, and has been termed the curse of dimensionality for more than 50 years [5].

Of course, just because a feature exists says nothing about whether it is a piece of information that might be useful for ML predictions or not. Features may thus prove to be one of four kinds: 1) strongly relevant; 2) weakly relevant; 3) irrelevant; or 4) redundant [6]. Strongly relevant features should always be considered; weakly relevant may sometimes be combined to improve the overall relevancy. All irrelevant or redundant features should be removed from consideration. Generally, the fewer the features the better, so long as the features used are strongly relevant and orthogonal (that is, they capture different aspects of the prediction space).

A (Partial) Inventory and Taxonomy of Natural Language and KB Features

To make this discussion more tangible, we have assembled a taxonomy of feature types in the context of natural language and knowledge bases. This inventory is drawn from the limited literature on feature engineering and selection in the context of KBAI from the perspectives of ML learning in general [7, 8, 9], ML learning ontologies [10, 11, 12] and knowledge bases [13, 14, 15, 16, 17]. This listing is only partial, but does provide an inventory of more than 200 feature types applicable to natural language.

We have organized this inventory into eight (8) main areas, shown in non-italicized Bold, which tend to cluster into these four groupings:

Surface features — these are features that one can see within the source documents and knowledge bases. They include Lexical items for the terms and phrases in the domain corpus and knowledge base; Syntactical items that show the word order or syntax of the domain; Structural items that either split the documents and corpus into parts or reflect connections and organizations of the items, such as hierarchies and graphs; or Natural Language items that reflect how the content is expressed in the surface forms of various human languages
Derived features — are surface features that have been transformed or derived in some manner, such as the direct Statistical items or the Model-based ones reflecting the characteristics of the machine learners used
Semantic features — these are summarized under the Semantics area, and reflect what the various items mean or how they are conceptually related to one another, and
Latent features — these features are not observable from the source content. Rather, these are statistically derived abstractions of the features above that are one- to N-levels removed from the initial source features. These Latent items may either be individual features or entire layers of abstraction removed from the surface layer. These features result from applying unsupervised or deep learning machine learners.

Here is the taxonomy, again organized according to those same eight main areas:

Lexical
	Corpus
	Phrases
		Averages
		Counts
		N-grams
		Weights
	Words
		Averages
		Counts
		Cut-offs (top N)
		Dictionaries
		Named entities
		Stemming
		Stoplists
		Terms
		Weights
Syntactical
	Anaphora
	Cases
	Complements (argument)
	Co-references
	Decorations
	Dependency grammar
		Head (linguistic)
	Distances
	Gender
	Moods
	Paragraphs
	Parts of speech (POS)
	Patterns
	Plurality
	Phrases
	Sentences
	Tenses
	Word order
Statistical
	Articles
		Vectors
	Information-theoretic
		Entropy
		Mutual information
	Meta-features
		Correlations
		Eigenvalues
		Kurtosis
		Sample measures
			Accuracy
			F-1
				Precision
				Relevance
		Skewness
		Vectors
		Weights
	Phrases
		Document frequencies
		Frequencies (corpus)
		Ranks
		Vectors
	Words
		Document frequencies
		Frequencies (corpus)
		Ranks
		String similarity
		Vectors
			Cosine measures
			Feature vectors
Structural
	Documents
		Node types
			Depth
			Leaf
	Document parts
		Abstract
		Authors
		Body
		Captions
		Dates
		Headers
		Images
		Infoboxes
		Links
		Lists
		Metadata
		Templates
		Title
		Topics
	Captions
	Disambiguation pages
	Discussion pages
		Authors
		Body
		Dates
		Links
		Topics
	Formats
	Graphs (and ontologies)
		Acyclic
		Concepts
			Centrality
			Relatedness
		Directed
		Metrics (counts, averages, min/max)
			Attributes
			Axioms
			Children
			Classes
			Depth
			Individuals
			Parents
		Sub-graphs
	Headers
		Content
		Section hierarchy
	Infoboxes
		Attributes
		Missing attributes
		Missing values
		Templates
		Values
	Language versions
		Definitions
		Entities
		Labels
		Links
		Synsets
	Links
		Category
		Incoming
		Linked data
		Outgoing
		See also
	Lists
		Ordered
		Unordered
	Media
		Audio
		Images
		Video
	Metadata
		Authorship
		Dates
		Descriptions
		Formats
		Provenance
	Pagination
	Patterns
		Dependency patterns
		Surface patterns
			Regular expressions
	Revisions
		Authorship
		Dates
		Structure
			Document parts
				Captions
				Headers
				Infoboxes
				Links
				Lists
				Metadata
				Templates
				Titles
		Versions
	Source forms
		Advertisements
		Blog posts
		Documents
			Research articles
			Technical documents
		Emails
		Microblogs (tweets)
		News
		Technical
		Web pages
	Templates
	Titles
	Trees
		Breadth measures
		Counts
		Depth measures
	Web pages
		Advertisements
		Body
		Footer
		Header
		Images
		Lists
		Menus
		Metadata
		Tables
Semantics	[most also subject to Syntactical and Statistical features above)
	Annotations
		Alternative labels
		Notes
		Preferred labels
	Associations
		Association rules
		Co-occurrences
		See also
	Attribute Types
		Attributes
			Cardinality
			Descriptive
			Qualifiers
			Quantifiers
				Many
		Values
			Datatypes
			Many
	Categories
		Eponymous pages
	Concepts
		Definitions
		Grouped concepts (topics)
		Hypernyms
			Hypernym-based feature vectors
		Hyponyms
		Meanings
		Synsets
			Acronyms
			Epithets
			Jargon
			Misspellings
			Nicknames
			Pseudonyms
			Redirects
			Synonyms
	Entity Types
		Entities
		Events
		Locations
	General semantic feature vectors
	Relation Types
		Binary
		Identity
		Logical conjunctions
			Conjunctive
			Disjunctive
		Mereology (part of)
		Relations
			Domain
			Range
		Similarity
	Roles
	Voice
		Active/passive
		Gender
		Mood
		Sentiment
		Style
		Viewpoint (World view)
Natural Languages
	Morphology
	Nouns
	Syntax
	Verbs
	Word order
Latent
	Autoencoders
		Many; dependent on method
	Features
		Many; dependent on method
	Hidden
		Many; dependent on method
	Kernels
		Many; dependent on method
Model-based
	Decision tree
		Tree measures
	Dimensionality
	Feature characteristics
		Datatypes
		Max
		Mean
		Min
		Number
		Outliers
		Standard deviation
	Functions
		Factor graphs
		Functors
		Mappings
	Landmarking
		Learner accuracy
	Method measures
		Error rates

Table 1. A (Partial) Taxonomy of Machine Learning Features

This compilation exceeds any prior listings. In fact, most of the feature types shown have never been applied to NLP machine learning tasks. We now turn the discussion to why this is.

Mindset and Knowledge Bases

When one sees the breadth of impressive knowledge discovery tasks utilizing large-scale knowledge bases [18], exemplified by hundreds of research papers regarding NLP tasks utilizing Wikipedia [19], it is but a small stretch to envision a coherent knowledge base leveraging this content for the express purpose of making text-based machine learning systematic and less expensive. Expressed as an objective function, we now have clear guidance for how to re-organize and -express the source content information (Wikipedia, among others) to better support a ML learning factory. The idea of this and how it is driving Structured Dynamics‘ contracts and research is the mindset.

Rather than the singleton efforts to leverage knowledge bases for background knowledge, as has been the case to date, we can re-structure the knowledge source content underneath a coherent knowledge graph. This re-organization makes the entire knowledge structure computable and amenable to machine learning. It also enables the same learning capability to be turned upon itself (see image here), thereby working to improve the coverage and accuracy of the starting KB, all in a virtuous circle. Because of the mindset, we also can now look at the native structure of the KBs and work to expose still more features, providing still further grist to the next generation ML learners. Fully 50% of the features listed in the inventory in Table 1 above arise from these unique KB aspects, especially in the areas of Semantics and Structural, including graph relationships.

Many, if not most, of these new feature possibilities may prove redundant or only somewhat relevant. Not all features may ever prove useful, though some not generally used in many broader learners, such as case, may be effectively employed for named entity or specialty extractions, such as for copyrights or unique IDs or data types. Because many of these KB features cover quite orthogonal aspects of the source knowledge bases, the likelihood of finding new, strongly relevant features is high. Further, except for the Latent and Model-based areas, each of these feature types may be used singly or in combination to create coherent slices for both positive and negative training sets, helping to reduce the effort for labor-intensive labeling as well. By extension, these capabilities can also be applied to more effectively bootstrap the creation of gold standards, useful when parameters are being tested for the various machine learners.

Though the literature most often points to classification as the primary use of knowledge bases as background knowledge supporting machine learners, in fact many NLP tasks may leverage KBs. Here is but a brief listing of application areas for KBAI:

Entity recognizers
Relation extractors
Classifiers
Q & A systems
Knowledge base mappings

Ontology development
Entity dictionaries
Data conversion and mapping
Master data management
Specialty extractors

Table 2. NLP Applications for Machine Learners Using KBs

Surely other applications will emerge as this more systematic KBAI approach to machine learning evolves over the coming years.

Feature Engineering is an Essential Component

As noted, this richness of feature types leads to the combinatorial problem of too many features. Feature engineering is important both to help find the features of strongest relevance while reducing the feature space dimensionality in order to speed the ML learning times.

Initial feature engineering tasks should be to transform input data, regularize them if need be, and to create numeric vectors for new ones. These are basically preparation tasks to convert the source or target data to forms amenable to machine learning. This staging now enables us to discover the most relevant (“strong”) features for the given ML method under investigation.

In a KB context, specific learning tasks as outlined in Table 2 are often highly patterned. The most effective features for training, say, an entity recognizer, will only involve a limited number of strongly relevant feature types. Moreover, the relevant feature types applicable to a given entity type should mostly apply to other entity types, even though the specific weights and individually important features will differ. This patterned aspect means that once a given ML learner is trained for a given entity type, its relevant feature types should be approximately applicable to other related entity types. The lengthy process of initial feature selection can be reduced as training proceeds for similar types. It appears that combinations of feature types, specific ML learners and methods to create training sets and gold standards may be discovered for entire classes of learning tasks. These combinations can be discovered, tested and repeated for new specific tasks within a given application cluster.

Probably the most time-consuming and demanding aspect of these patterned approaches resides in feature selection and feature extraction.

Feature selection is the process of finding a subset of the available feature types that provide the highest predictive value while not overfitting [20]. Feature selection is typically split into three main approaches [6, 21, 22]:

Filter — select the N most promising features based on a ranking from some form of proxy measure, like mutual information or the Pearson correlation coefficient, which provides a measure of the information gain from using a given feature type
Wrapper — wherein feature subsets are tested through a greedy search heuristic that either starts with an empty set and adds features (forward selection) keeping the “strongest” ones, or starts with a full set and gradually removes the “weakest” ones (backward selection); the wrapper approach may be computationally expensive, or
Embedded — wherein feature selection is a part of model construction.

For high-dimensional features, such as terms and term vectors, one may apply stoplists or cut-offs (only considering the top N most frequent terms, for example) to reduce dimensionality. Part of the “art” portion resides in knowing which feature candidates may warrant formal selection or not; this learning can be codified and reused for similar applications. Extractions and some unsupervised learning tests may also be applied at this point in order to discover additional “strong” features.

Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. Functions create new features in the form of Latent variables, which are not directly observable. Also, because these are statistically derived values, many input features are reduced to the synthetic measure, which naturally causes a reduction in dimensionality. Advantages from a reduction in dimensionality include:

Often a better feature set (resulting in better predictions) [23]
Faster computation and smaller storage
Reduction in collinearity due to reduction in weakly interacting inputs
Easier graphing and visualization.

On the other hand, the latent features are abstractions, and so not easily understood as the literal.

In deep learning, multiple layers of these latent features are generated as the system learns. But latent passes may also be combined with observable features, which is one way that evaluations of what a document means can be applied across multiple input forms of the content.

Of course, it is also possible to combine the predictions from multiple ML methods, which then also raises the questions of ensemble scoring. Surely we will also see these more systematic approaches to machine learning themselves be subject to self-learning (that is, metalearning), such that the overall learning process can proceed in a more automated way.

Considerations for a Feature Science

In supervised learning, it is clear that more time and attention has been given to the labeling of the data, what the desired output of the model should be. Much less time and attention has been devoted to features, the input side of the equation. As a result, much needs to be done. The purposeful use of knowledge bases and structuring them properly is one of the ways progress will be made.

But progress also requires some answers to some basic questions. A scientific approach to the feature space would likely need to consider, among other objectives:

Full understanding of surface, derived and latent features
Relating various use cases and problems to specific machine learners and classes of learners
Relating specific machine learners to the usefulness of particular features (see also hyperparameter optimization and model selection)
Improved methods for feature engineering and construction
Improved methods for feature selection
A better understanding of how to select and match supervised and unsupervised ML.

Some tools and utilities would also help to promote this progress. Some of these capabilities include:

Feature inventories — how to create and document taxonomies of feature types
Feature generation — methods for codification of leading recipes
Feature transformations — the same for transformations, up to and including vector creation
Feature validation — ways to test feature sets in standard ways.

Role of a Platform

The object of these efforts is to systematize how knowledge bases, combined with machine learners, can speed the deployment and lower the cost of creating tailored artificial intelligence applications of natural language for specific domains. This installment in our KBAI series has focused on the role and importance of features. There is an abundance of opportunity in this area, and an abundance of work required, but little systematization.

The good news is that platforms are possible that can build, manage, and grow the knowledge bases and knowledge graphs supporting machine learning. Machine learners can be applied in a pipeline manner to these KBs, including orchestrating the data flows in generating and testing features, running and testing learners, creating positive and negative training sets, and establishing gold standards. The heart of the platform must be an appropriately structured knowledge base organized according to a coherent knowledge graph; this is the present focus of Structured Dynamics’ efforts.

In the real world, engagements always demand unique scope and unique use cases. Platforms should be engineered that enable ready access, extensions, configurations, and learners. It is important to structure the KBs such that slices and modules can be specified, and all surface attributes may be selected and queried. Mapping to external schema is also essential. Background knowledge from a coherent knowledge base is the way to fuel this.

[1] Features apply to any form of machine learning, including for things like image, speech and pattern recognition. However, this article is limited to the context of natural language, unstructured data and knowledge bases.

[2] See the feature entry from Wikipedia, which itself is based upon Christoper Bishop, 2006. Pattern Recognition and Machine Learning. Berlin: Springer. ISBN 0-387-31073-8.

[3] Pedro Domingos, 2012. “A Few Useful Things to Know About Machine Learning,” Communications of the ACM 55, no. 10 (2012): 78-87.

[4] For example, in the term or phrase space, the vectors might be constructed from counts, frequencies, cosine relationships between representative documents, distance functions between terms, etc.

[5] Richard Ernest Bellman, 1957. Dynamic Programming, Rand Corporation, Princeton University Press, ISBN 978-0-691-07951-6, as republished as Richard Ernest Bellman, 2003. Dynamic Programming, Courier Dover Publications, ISBN 978-0-486-42809-3.

[6] Isabell Guyon and André Elisseeff, 2006. “An Introduction to Feature Extraction,” in Guyon, Isabelle, Steve Gunn, Masoud Nikravesh, and Lofti A. Zadeh, eds. Feature Extraction: Foundations and Applications, pp. 1-25. Springer Berlin Heidelberg, 2006.

[7] Haussler, David, 1999. Convolution Kernels on Discrete Structures, Vol. 646. Technical Report UCSC-CRL-99-10, Department of Computer Science, University of California at Santa Cruz, 38 pp., July 8, 1999.

[8] Reif, Matthias, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel, 2014. “Automatic Classifier Selection for Non-experts,” Pattern Analysis and Applications 17, no. 1 (2014): 83-96.

[9] Tang, Jiliang, Salem Alelyani, and Huan Liu, 2014. “Feature Selection for Classification: A Review.” Data Classification: Algorithms and Applications (2014): 37

[10] Melanie Hilario, Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis, 2011 “Ontology-based Meta-mining of Knowledge Discovery Workflows,” in Meta-Learning in Computational Intelligence, pp. 273-315. Springer Berlin Heidelberg, 2011.

[11] Panče Panov, Larisa Soldatova, and Sašo Džeroski, 2014. “Ontology of Core Data Mining Entities,” Data Mining and Knowledge Discovery 28, no. 5-6 (2014): 1222-1265.

[12] See the general KBAI category entries on M.K. Bergman, AI3:::Adaptive Information blog, various dates.

[13] Ivo Anastacio, Bruno Martins and Pavel Calado, 2011. “Supervised Learning for Linking Named Entities to Knowledge Base Entries,” in Proceedings of the Text Analysis Conference (TAC2011).

[14] Weiwei Cheng, Gjergji Kasneci, Thore Graepel, David Stern, and Ralf Herbrich, 2011. “Automated Feature Generation from Structured Knowledge,” in Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1395-1404. ACM, 2011.

[15] Lan Huang, David Milne, Eibe Frank, and Ian H. Witten, 2012. “Learning a Concept‐based Document Similarity Measure.” Journal of the American Society for Information Science and Technology 63, no. 8 (2012): 1593-1608.

[16] Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp.

[17] Hui Shen, Mika Chen, Razvan Bunescu, and Rada Mihalcea, 2012. “Wikipedia Taxonomic Relation Extraction using Wikipedia Distant Supervision,” Ann Arbor 1001: 48109.

[18] Conventional knowledge bases have also been supplemented with massive-scale statistical bases, most often created from major search engine indexes; see the section on ‘Statistical Corpora’ in M.K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” AI3:::Adaptive Information blog, November 14, 2014.

[19] See M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010. The listing as of its last update included 246 articles; also, see Wikipedia’s own “Wikipedia in Academic Studies.”

[20] Overfitting is where a statistical model, such as a machine learner, describes random error or noise instead of the underlying relationship. It is particularly a problem in high-dimensional spaces, a common outcome of employing too many features.

[21] George H. John, Ron Kohavi, and Karl Pfleger, 1994. “Irrelevant features and the subset selection problem.” In Machine Learning: Proceedings of the Eleventh International Conference, pp. 121-129. 1994.

[22] See especially slide #11 in Zdeněk Žabokrtský, 2015. “Feature Engineering in Machine Learning,” Machine Learning Methods course, Institute of Formal and Applied Linguistics, Charles University in Prague, Czech Republic.

[23] If constructed properly, deep learning models can be effective feature extractors over high-dimensional data; see Geoffrey E. Hinton, 2009. “Deep Belief Networks,” Scholarpedia 4 (5): 5947, which references an earlier paper, Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. “A Fast Learning Algorithm for Deep Belief Nets,” Neural Computation 18, no. 7 (2006): 1527-1554.

Main Links

Search

Author: Mike Bergman