Posted:March 14, 2016

AI3 PulseA New Era in Artificial Intelligence Will Open Pandora’s Box

Here’s a prediction: the new emphasis on artificial intelligence and robotics will occasion some new looks at knowledge representation. Prior to the past few years many knowledge representation (KR) projects have been more in the way of prototypes or games. But, now that we are seeing real robotics and knowledge-based AI activities take off, some of the prior warts and problems of leading KR approaches are starting to become evident.

For example, for years major upper-level ontologies have tended to emphasize dichotomous splits in how to “model” the world, including:

  • abstract-physical — a split between what is fictional or conceptual and what is tangibly real
  • occurrent-continuant — a split between a “snapshot” view of the world and its entities versus a “spanning” view that is explicit about changes in things over time
  • perduant-endurant — a split for how to regard the identity of individuals, either as a sequence of individuals distinguished by temporal parts (for example, childhood or adulthood) or as the individual enduring over time
  • dependent-independent — a split between accidents (which depend on some other entity) and substances (which are independent)
  • particulars-universals — a split between individuals in space and time that cannot be attributed to other entities versus abstract universals such as properties that may be assigned to anything
  • determinate-indeterminate.

Since the mid-1980s, description logics have also tended to govern most KR languages, and are the basis of the semantic Web data model and languages of RDF and OWL. (However, common logic and its dialects are also used as a more complete representation of first-order logic.) The trade-off in KR language design is one of expressiveness versus complexity.

Cyc was developed as one means to address a gap in standard KR approaches: how to capture and model common sense. Conceptual graphs, formally a part of common logic, were developed to handle n-ary relationships and the questions of sign processes (semiosis), fallibility and processes of pragmatic learning.

Zhou offers a new take on an old strategy to KR, which is to use set theory as the underlying formalism [1]. This first paper deals with the representation itself; a later paper is planned on reasoning.

We do not live in a dichotomous world. And, I personally find Charles Peirce’s semeiosis to be a more compelling grounding for what a KR design should look like. But as Zhou points out, and is evident in current AI advances, robotics and the need for efficient, effective reasoning are testing today’s standards in knowledge representation as never before. I suspect we are in for a period of ferment and innovation as we work to get our KR languages up to task.


[1] Yi Zhou, 2016. “A Set Theoretic Approach for Knowledge Representation: the Representation Part,” arXiv:1603.03511, 14 Mar 2016.
Posted:March 8, 2016

Fine-grained EntitiesA Typology Design Aids Continuous, Logical Typing

Entity recognition or extraction is a key task in natural language processing and one of the most common uses for knowledge bases. Entities are the unique, individual things in the world, and are also sometimes used to characterize some concepts [1]. Context plays an essential role in entity recognition. In general terms we may refer to a thing such as a camera; but a photographer may want more fine-grained distinctions such as SLR cameras or further sub-types like digital SLR cameras or even specific models like the Canon EOS 7D Mark II or even the name of the photographer’s favorite camera, such as ‘Shutter Sue‘. Capitalized names (as is the reference source for named entity recognition) often signals we are dealing with a true individual entity, but again, depending on context, a named automobile such as Chevy Malibu may refer to a specific car or to the entire class of Malibu cars.

The “official” practice of named entity recognition began with the Message Understanding Conferences, especially MUC-6 and MUC-7, in 1995 and 1997. These conferences began competitions for finding “named entities” as well as the practice of in-line tagging [2]. Some of these accepted ‘named entities‘ are also written in lower case, with examples such as rocks (‘gneiss’) or common animals or plants (‘daisy’) or chemicals (‘ozone’) or minerals (‘mica’) or drugs (‘aspirin’) or foods (‘sushi’) or whatever. Some deference was given to the idea of Kripke’s “rigid designators” as providing guidance for how to identify entities; rigid designators include proper names as well as certain natural kinds of terms like biological species and substances. Because of these blurrings, the nomenclature of “named entities” began to fade away. Some practitioners still use the term of named entities, though for some of the reasons outlined in this paper, Structured Dynamics prefers simply to use entity.

Much has changed in the twenty years since the seminal MUC conferences regarding entity recognition and characterization. We are learning to adopt a very fine-grained approach to entity types and a typology design suited to interoperating (“bridging”) over a broad range of viewpoints and contexts. Most broadly, the idea of fine-grained entity types has led us to a logically grounded typology design.

The Growing Trend to Fine-Grained Entity Types

Beginning with the original MUC conferences, the initial entity types tested and recognized were for person, organization, and location names [3]. However, it did not take long for various groups and researchers to want more entity types, more distinctions. BBN categories, proposed in 2002, were used for question answering and consisted of 29 types and 64 subtypes [4]. Sekine put forward and refined over many years his Extended Entity Types, which grew to about 200 types [5], as shown in this figure:

Sekine Extended Entity Types

Sekine Extended Entity Types

These ideas of extended entity types helped inform a variety of tagging services over the past decade, notably including OpenCalais, Zemanta, AlchemyAPI, and OpenAmplify, among others. Moreover the research community also expanded its efforts into more and more entity types, or what came to be known as fine-grained entities [6].

Some of these produced more formal organizations of entity type classifications. This one, from Ling and Weld proposed 112 entity types in 2012 [7]:

Ling 112 Entity Types

Ling 112 Entity Types

Another one, from Gillick et al. in 2014 proposed 86 entity types [8], organized, in part, according to the same person, organization, and location types from the earliest MUC conferences:

Gillick 86 Entity Types

Gillick 86 Entity Types

These efforts are also notable because machine learners have been trained to recognize the types shown. What entity types are covered, the different conceptions of the world, and how to organize entity types varies broadly across these references.

The complement to entity extraction for unstructured text is to label the text in the first place. For this, a number of schema presently exist that provide vocabularies of entity types and standard means for tagging text. These include:

  • DBpedia Ontology: 738 types [9]
  • schema.org: 636 types [10]
  • YAGO: 505 types; see also HYENA [11]
  • GeoNames: 654 “feature codes” [12]

In Structured Dynamics’ own work, we have mapped the UMBEL knowledge graph against Wikipedia content and found that 25,000 nodes, or more than 70 percent of its 35,000 reference concepts, correspond to entity types [13]. These mappings provide typing connections for millions of Wikipedia articles. The typing and organization of entity types thus appears to be of enormous importance in modeling and leveraging the use of knowledge bases.

When we track the coverage of entity types over the past two decades we see logarithmic growth [13]:

Growth in Recognition of Entity Types

Growth in Recognition of Entity Types

This growth in entity types comes from wanting to describe and organize things with more precision. Tagging and extracting structured information from text are obviously a key driver. Yet, for a given enterprise, what is of interest — and at what depth — for a particular task varies widely.

The fact that knowledge bases, such as Wikipedia (but, the lesson applies to domain-specific ones as well), can be supported by entity-level information for literally thousands of entity types means that rich information is available for driving the finest of fine-grained entity extractors. To leverage this raw, informational horsepower it is essential to have a grounded understanding of what an entity is, how to organize them into logical types, and an intensional understanding of the attributes and characteristics that allow inferencing to be conducted over these types. These understandings, in turn, point to the features that are useful to machine learners for artificial intelligence. These understandings also can inform a flexible design for accommodating entity types from coarse- to fine-grained, with variable depth depending on the domain of interest.

Natural Classes and Typologies

We take a realistic view of the world. That is, we believe that what we perceive in the world is real — it is not just a consequence of what we perceive and can be aware of in our minds [14] — and that there are forces and relationships in the world independent of us as selves. Realism is a longstanding tradition in philosophy that extends back to Aristotle and embraces, for example, the natural classification systems of living things as espoused by taxonomists such as Agassiz and Linnaeus.

Charles Sanders Peirce, an American logician and scientist of the late 19th and early 20th centuries, embraced this realistic philosophy but also embedded it in a belief that our understanding of the world is fallible and that we needed to test our perceptions via logic (the scientific method) and shared consensus within the community. His overall approach is known as pragmatism and is firmly grounded in his views of logic and his theory of signs (called semiotics or semeiotics). While there is absolute truth, it actually acts more as a limit, to which our seeking of additional knowledge and clarity of communication with language continuously approximates. Through the scientific method and questioning we get closer and closer to the truth and to an ability to communicate it to one another. But new knowledge may change those understandings, which in any case will always remain proximate.

Peirce’s own words can better illustrate his perspective [15], some of which I have discussed elsewhere under his idea of “natural classes” [16]:

“Thought is not necessarily connected with a brain. It appears in the work of bees, of crystals, and throughout the purely physical world; and one can no more deny that it is really there, than that the colors, the shapes, etc., of objects are really there.” (Peirce CP 4.551)

“What if we try taking the term “natural,” or “real, class” to mean a class of which all the members owe their existence as members of the class to a common final cause? This is somewhat vague; but it is better to allow a term like this to remain vague, until we see our way to rational precision.” (Peirce CP 1.204)

“. . . it may be quite impossible to draw a sharp line of demarcation between two classes, although they are real and natural classes in strictest truth. Namely, this will happen when the form about which the individuals of one class cluster is not so unlike the form about which individuals of another class cluster but that variations from each middling form may precisely agree.” (Peirce CP 1.208)

“When one can lay one’s finger upon the purpose to which a class of things owes its origin, then indeed abstract definition may formulate that purpose. But when one cannot do that, but one can trace the genesis of a class and ascertain how several have been derived by different lines of descent from one less specialized form, this is the best route toward an understanding of what the natural classes are.” (Peirce CP 1.208)

“The descriptive definition of a natural class, according to what I have been saying, is not the essence of it. It is only an enumeration of tests by which the class may be recognized in any one of its members. A description of a natural class must be founded upon samples of it or typical examples.” (Peirce CP 1.223)

“Natural classes” thus are a testable means to organize the real objects in the world, the individual particulars of what we call “entities”. In Structured Dynamics’ usage, we define an entity as something that is an individual object, either real or mental such as an idea, either a part or a whole, and that has:

  • identity, which can be referred to via symbolic names
  • context in relation to other objects, and
  • characteristic attributes, with some expressing the essence of what type of object it is.

The key to classification of entities into categories (or “types” as we use herein) is based on this intensional understanding of attributes. Further, Peirce was expansive in his recognition of what kinds of objects could be classified, specifically including ideas, with application to areas such as social classes, man-made objects, the sciences, chemical elements and living organisms [17]. Again, here are some of Peirce’s own words on the classification of entities [15]:

“All classification, whether artificial or natural, is the arrangement of objects according to ideas. A natural classification is the arrangement of them according to those ideas from which their existence results.” (Peirce CP 1.231)

“The natural classification of science must be based on the study of the history of science; and it is upon this same foundation that the alcove-classification of a library must be based.” (Peirce CP 1.268)

“All natural classification is then essentially, we may almost say, an attempt to find out the true genesis of the objects classified. But by genesis must be understood, not the efficient action which produces the whole by producing the parts, but the final action which produces the parts because they are needed to make the whole. Genesis is production from ideas. It may be difficult to understand how this is true in the biological world, though there is proof enough that it is so. But in regard to science it is a proposition easily enough intelligible. A science is defined by its problem; and its problem is clearly formulated on the basis of abstracter science.” (Peirce CP 1.227)

A natural classification system is one, then, that logically organizes entities with shared attributes into a hierarchy of types, with each type inheriting attributes from its parents and being distinguished by what Peirce calls its final cause, or purpose. This hierarchy of types is thus naturally termed a typology.

An individual that is a member of a natural class has the same kinds of attributes as other members, all of which share this essence of the final cause or purpose. We look to Peirce for the guidance in this area because his method of classification is testable, based on discernable attributes, and grounded in logic. Further, that logic is itself grounded in his theory of signs, which ties these understandings ultimately to natural language.

Logic and the Typology Design

Unlike more interconnected knowledge graphs (which can have many network linkages), typologies are organized strictly along these lines of shared attributes, which is both simpler and provides an orthogonal means for investigating type class membership. Further, because the essential attributes or characteristics across entities in an entire domain can differ broadly — such as living v inanimate things, natural things v man-made things, ideas v physical objects, etc. — it is possible to make disjointedness assertions between entire groupings of natural entity classes. Disjoint assertions combined with logical organization and inference mean a typology design that lends itself to reasoning and tractability.

The idea of nested, hierarchical types organized into broad branches of different entity typologies also provides a very flexible design for interoperating with a diversity of world views and degrees of specificity. The photographer, as I discussed above, is interested in different camera types and even how specific cameras can relate to a detailed entity typing structure. Another party more interested in products across the board may have a view to greater breadth, but lesser depth, about cameras and related equipment. A typology design, logically organized and placed into a consistent grounding of attributes, can readily interoperate with these different world views.

A typology design for organizing entities can thus be visualized as a kind of accordion or squeezebox, expandable when detail requires, or collapsed to more coarse-grained when relating to broader views. The organization of entity types also has a different structure than the more graph-like organization of higher-level conceptual schema, or knowledge graphs. In the cases of broad knowledge bases, such as UMBEL or Wikipedia, where 70 percent or more of the overall schema is related to entity types, more attention can now be devoted to aspects of concepts or relations.

The idea that knowledge bases can be purposefully crafted to support knowledge-based artificial intelligence, or KBAI, flows from these kinds of realizations. We begin to see that we can tease out different aspects of a knowledge base, each with its own logic and relation to the other aspects. Concepts, entities, attributes and relations — including the natural classes or types that can logically organize them — all deserve discrete attention and treatment.

Peirce’s consistent belief that the real world can be logically conceived and organized provides guidance for how we can continue to structure our knowledge bases into computable form. We now have a coherent base for treating entities and their natural classes as an essential component to that thinking. We can continue to be more fine-grained so long as there are unique essences to things that enable them to be grouped into natural classes.


[1] The role for the label “entity” can also refer to what is known as the root node in some systems such as SUMO (see also http://virtual.cvut.cz/kifb/en/toc/229.html). In the OWL language and RDF data model we use, the root node is known as “thing”. Clearly, our use of the term “entity” is much different than SUMO and resides at a subsidiary place in the overall TBox hierarchy. In this case, and frankly for most semantic matches, equivalences should be judged with care, with context the crucial deciding factor.
[2] N. Chinchor, 1997. “Overview of MUC-7,” MUC-7 Proceedings, 1997.
[3] While all of these are indeed entity types, the early MUCs also tested dates, times, percentages, and monetary amounts.
[4] Ada Brunstein, 2002. “Annotation Guidelines for Answer Types”. LDC Catalog, Linguistic Data Consortium. Aug 3, 2002.
[5] See the Sekine Extended Entity Types; the listing also includes attributes info at bottom of source page.
[6] For example, try this query, https://scholar.google.com/scholar?q=”fine-grained+entity”, also without quotes.
[7] Xiao Ling and Daniel S. Weld, 2012. “Fine-Grained Entity Recognition,” in AAAI. 2012.
[8] Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh, 2104. “Context-Dependent Fine-Grained Entity Type Tagging,” arXiv preprint arXiv:1412.1820 (2014).
[9] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann, 2009. “DBpedia-A Crystallization Point for the Web of Data.” Web Semantics: science, services and agents on the world wide web 7, no. 3 (2009): 154-165; 170 classes in this paper. That has grown to more than 700; see http://mappings.dbpedia.org/server/ontology/classes/ and http://wiki.dbpedia.org/services-resources/datasets/dataset-2015-04/dataset-2015-04-statistics.
[10] The listing is under some dynamic growth. This is the official count as of September 8, 2015, from http://schema.org/docs/full.html. Current updates are available from Github.
[11] Joanna Biega, Erdal Kuzey, and Fabian M. Suchanek, 2013. “Inside YAGO2: A Transparent Information Extraction Architecture,” in Proceedings of the 22nd international conference on World Wide Web, pp. 325-328. International World Wide Web Conferences Steering Committee, 2013. Also see Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, Gerhard Weikum, 2012. “HYENA: Hierarchical Type Classification for Entity Names,” in Proceedings of the 24th International Conference on Computational Linguistics, Coling 2012, Mumbai, India, 2012.
[13] This figure and some of the accompanying text comes from a prior article, M.K. Bergman, “Creating a Platform for Machine-based Artificial Intelligence“, AI3:::Adaptive Information blog, September 21, 2015.
[14] Realism is often contrasted to idealism, nominalism or conceptualism, wherein how the world exists is a function of how we think about or name things. Descartes, for example, summarized his conceptualist view with his aphorism “I think, therefore I am.”
[15] See the electronic edition of The Collected Papers of Charles Sanders Peirce, reproducing Vols. I-VI, Charles Hartshorne and Paul Weiss, eds., 1931-1935, Harvard University Press, Cambridge, Mass., and Arthur W. Burks, ed., 1958, Vols. VII-VIII, Harvard University Press, Cambridge, Mass. The citation scheme is volume number using Arabic numerals followed by section number from the collected papers, shown as, for example, CP 1.208.
[16] M.K. Bergman, 2015. “‘Natural’ Classes in the Knowledge Web“, AI3:::Adaptive Information blog, July 13, 2015.
[17] See, for example, Menno Hulswit, 2000. “Natural Classes and Causation“, in the online Digital Encyclopedia of Charles S. Peirce.
Posted:February 23, 2016

AI3 PulseArticle Offers a Balanced View on AI and the Singularity

Possibly because we are sentient, intelligent beings, discussions about artificial intelligence often occupy extremes of alarm, potential or hyperbole. What makes us unique as humans, at least in our degree of intelligence, can be threatened when we start granting machines similar capabilities. Be it Skynet, Lt. Commander Data, military robots, or the singularity, it is pretty easy to grab attention by touting AI as the greatest threat to civilization, or the dawning of a new age of super intelligence.

To be sure, we are seeing remarkable advances in things like intelligent personal assistants that answer our spoken questions, or services that can automatically recognize and tag our images, or many, many other applications. It is also appropriate to raise questions about autonomous intelligence and its possible role in warfare [1] or other areas of risk or harm. AI is undoubtedly an area of technology innovation on the rise. It will also be a constant in human affairs into the future.

That is why a recent article by Toby Walsh on The Singularity May Never Be Near [2] is worth a read. Though only four pages long, it presents a nice historical backdrop on AI and why artificial intelligence may not unfold as many suspect. As he summarizes the article:

There is both much optimism and pessimism around artificial intelligence (AI) today. The optimists are investing millions of dollars, and even in some cases billions of dollars into AI. The pessimists, on the other hand, predict that AI will end many things: jobs, warfare, and even the human race. Both the optimists and the pessimists often appeal to the idea of a technological singularity, a point in time where machine intelligence starts to run away, and a new, more intelligent species starts to inhabit the earth. If the optimists are right, this will be a moment that fundamentally changes our economy and our society. If the pessimists are right, this will be a moment that also fundamentally changes our economy and our society. It is therefore very worthwhile spending some time deciding if either of them might be right.

[1] Samuel Gibbs, 2015. “Musk, Wozniak and Hawking Urge Ban on Warfare AI and Autonomous Weapons,” The Guardian, 27 July 2015.
[2] Toby Walsh, 2016. “The Singularity May Never Be Near,” arXiv:1602.06462, 20 Feb 2016.

Posted by AI3's author, Mike Bergman Posted on February 23, 2016 at 1:21 pm in Artificial Intelligence, Pulse | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/1924/cooling-the-heated-rhetoric-on-ai/
The URI to trackback this post is: https://www.mkbergman.com/1924/cooling-the-heated-rhetoric-on-ai/trackback/
Posted:February 16, 2016

AI3 PulseTechnical Debts Accrue from Dependencies, Adapting to Change, and Maintenance

Machine learning has entered a golden age of open source toolkits and much electronic and labeled data upon which to train them. The proliferation of applications and relative ease of standing up a working instance — what one might call “first twitch” — have made machine learning a strong seductress.

But embedding machine learning into production environments that can be sustained as needs and knowledge change is another matter. The first part of the process means that data must be found (and labeled if using supervised learning) and then tested against one or more machine learners. Knowing how to use and select features plus systematic ways to leverage knowledge bases are essential at this stage. Reference (or “gold”) standards are also essential as parameters and feature sets are tuned for the applicable learners. Only then can one produce enterprise-ready results.

Those set-up efforts are the visible part of the iceberg. What lies underneath the surface, as a group of experienced Google researchers warns us in a recent paper, Hidden Technical Debt in Machine Learning Systems [1], dwarfs the initial development of production-grade results. Maintaining these systems over time is “difficult and expensive”, exposing ongoing requirements as technical debt. Like any kind of debt, these requirements must be serviced, with delays or lack of a systematic way to deal with the debt adding to the accrued cost.

ML code (small black box in middle) is but a fraction of total infrastructure required for machine learning; from [1]

The authors argue that ML installations incur larger than normal technical debt, since machine learning has to be deployed and maintained similar to traditional code, plus the nature of ML imposes additional and unique costs. Some of these sources of hidden cost include:

  • Complex models with indeterminate boundaries — ML learners are entangled with multiple feature sets; changing anything changes everything (CACE) say the authors
  • Costly data dependencies — learning is attuned to the input data; as that data changes, learners may need to be re-trained with generation anew of input feature sets; existing features may cease to be significant
  • Feedback loops and interactions — the best performing systems may depend on multiple learners or less than obvious feedback loops, again leading to CACE
  • Sub-optimal systems — piecing together multiple open source pieces with “glue code” or using multi-purpose toolkits can lead to code and architectures that are not performant
  • Configuration debt — set-up and workflows need to work as a system and consistently, but tuning and optimization are generally elusive to understand and measure
  • Infrastructure debt — efforts in creating standards, testing options, logging and monitoring, managing multiple models, etc., are likely all more demanding than traditional systems, and
  • A constantly changing world — the nature of knowledge is it is always under constant flux. We learn more, facts and data change, new perspectives need to be incorporated, all of which need to percolate through the learning process and then be supported by the infrastructure.

The authors of the paper do not really offer any solutions or guidelines to these challenges. However, highlighting the nature of these challenges — as this paper does well — should forewarn any enterprise considering its own machine learning initiative. These costs can only be managed by anticipating and planning for them, preferably supported by systematic and repeatable utilities and workflows.

I recommend a close read of this paper before budgeting your own efforts.

(Hat tip to Mohan Bavirisetty for posting the paper link on LinkedIn.)


[1] Sculley, D., Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison, 2015. “Hidden Technical Debt in Machine Learning Systems.” In Advances in Neural Information Processing Systems, pp. 2494-2502.
Posted:February 8, 2016

AI3 PulseA Needed Focus on the Inputs to Machine Learners

Features are the inputs to machine learners. The outputs of machine learners are predictions of outcomes, based on an inferred (or “learned”) model or function. In image recognition, as an example, the inputs are the characteristics of pixels and those adjacent to them; the output may be a prediction there is an image representation of “cat”. In NLP, as another case, the input might be the text, title and URL of emails; the output may be a prediction of “spam”. If we treat all ML learners as black boxes, features are what is fed to the box, and predicted labels or structures are what comes out.

As I recently argued, the importance of features has been overlooked in comparison to the choice of machine learners or how to lower the costs and efforts of labeling and creating training sets and standards. The complete picture needs to include feature extraction, feature selection, and feature engineering.

A recent review paper helps redress this imbalance. Feature Selection: A Data Perspective [1], surveys and provides a comprehensive and well-organized overview of recent advances in feature selection research. According to the authors, Li et al., “the objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and helping prepare, clean, and understand data.” The practical 73-page review is accompanied by an open-source feature selection library that consists of most of the popular feature selection algorithms covered in the review, and a comprehensive performance analysis of the methods and their results.

The first nine pages of the review are devoted to a broad, accessible overview. The intro provides a clear explanation of features and their role in feature selection. It also explains why the high-dimensionality of features is a challenge in its own right.

The bulk of the document is devoted to a discussion of the various methods used in feature selection, organized according to:

  • generic data
  • structure features
  • heterogeneous data, and
  • streaming data.

Each of the methods is characterized as to whether it is applicable to supervised or unsupervised learning. While I have used a different classification of the feature space, that does not affect the usefulness of Li et al.’s [1] approach. Also, in keeping with a review article, there are more than 11 pages of references containing nearly 150 citations.

The combined review nature of the paper also means that various methods have been reduced to a common symbol set, which is a handy way to relate available features to multiple learners. This common treatment enables the authors to create the open source repository, scikit-feast, written in Python and available from Github, that provides a library of 25 of the methods covered. A separate Web site presents some test datasets and performance results. Here is one example of many of the available results:

This paper deserves a permanent place on anyone’s resource shelf who has a serious interest in machine learning. I would not be surprised to see the authors’ organizational structure of feature selection methods become a standard. It is always a pleasure to encounter papers that are well-written, understandable and comprehensive. Great job!


[1] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, Huan Liu, 2016. “Feature Selection: A Data Perspective,” arXiv:1601.07996, 29 Jan 2016.