Posted:January 27, 2016

AI Spring, wallpaper from pixelstalk.netPart II in Our Series on the Resurgency of Artificial Intelligence

In Part I of this series we pointed to the importance of large electronic knowledge bases (Big Data) in the recent advances in AI for knowledge- and text-based systems. Amongst other factors such as speedy GPUs and algorithm advances, we noted that electronic knowledge bases are perhaps the most important factor in the resurgence of artificial intelligence.

But the real question is not what the most important factor in the resurgence of AI may be — though the answer to that points to vetted, reference training sources. The real question is: How can we systematize this understanding by improving the usefulness of knowledge bases to support AI machine learning? Knowing that knowledge bases are important is not enough. If we can better understand what promotes — and what hinders — KBs for machine learning, perhaps we can design KB approaches that are even quicker and more effective for AI. In short, what should we be coding to promote knowledge-based artificial intelligence (KBAI)?

Why is There Not a Systematic Approach to Using KBs in AI?

To probe this question, let’s take the case of Wikipedia, the most important of the knowledge bases for AI machine learning purposes. According to Google Scholar, there have been more than 15,000 research articles relating Wikipedia to machine learning or artificial intelligence [1]. Growth in articles noting the use of Wikipedia for AI has been particularly strong in the past five years [1].

But two things are remarkable about this use of Wikipedia. First, virtually every one of these papers is a one-off. Each project stages and uses Wikipedia in its own way and with its own methodology and focus. Second, Wikipedia is able to make its contributions despite the fact there are many weaknesses and gaps within the knowledge base itself. Clearly, despite weaknesses, the availability of such large-scale knowledge in electronic form still provides significant advantages to AI and machine learning in the realm of natural language applications and text understanding. How much better might our learners be if we fed them more complete and coherent information?

Readers of this blog will be familiar with my periodic criticisms of Wikipedia as a structured knowledge resource [2]. These criticisms are not related to the general scope and coverage of Wikipedia, which, overall, is remarkable and unprecedented in human endeavor. Rather, the criticisms relate to the use of Wikipedia as is for knowledge representation purposes. To recap, here are some of the weaknesses of Wikipedia as a knowledge resource for AI:

  • Incoherency — the category structure of Wikipedia is particularly problematic. More than 60% of existing Wikipedia categories are not true (or “natural”) categories at all, but represent groupings more of convenience or compound attributes (such as Films directed by Pedro Almodóvar or Ambassadors of the United States to Mexico) [3]
  • Incomplete structure — attributes, as presented in Wikipedia’s infoboxes, are incomplete within and across entities, and links also have gaps. Wikidata offers promise to help bring greater consistency, but much will need to be achieved with bots and provenance remains an issue
  • Incomplete coverage — the coverage and scope of Wikipedia are spotty, especially across language versions, and in any case the entities and concepts covered need to meet Wikipedia’s notability guidelines. For much domain analysis, Wikipedia’s domain coverage is inadequate. It would also be helpful if there were ways to extend the KB’s coverage for local or enterprise purposes
  • Inaccuracies — actually, given its crowdsourced nature, popular portions of Wikipedia are fairly vetted for accuracy. Peripheral or stub aspects of Wikipedia, however, may retain inaccuracies of coverage, tone or representation.

As the tremendous use of Wikipedia for research shows, none of these weaknesses is fatal, and none alone has prevented meaningful use of the knowledge base. Further, there is much active research in areas such as knowledge base population [4] that promise to aid solutions to some of these weaknesses. The recognition of the success of knowledge bases to train AI machine learners is also now increasing awareness that KB design for AI purposes is a worthwhile research topic in its own right. Much is happening leveraging AI in bot designs for both Wikipedia and Wikidata. A better understanding of how to test and ensure coherency, matched with a knowledge graph for inferencing and logic, should help promote better and faster AI learners. The same techniques for testing consistency and coherence may be applied to mapping external KBs into such a reference structure.

Thus, the real question again is: How can we systematize the usefulness of knowledge bases to support AI machine learning? Simply by asking this question we can alter our mindset to discover readily available ways to improve knowledge bases for KBAI purposes.

Working Backwards from the Needs of Machine Learners

The best perspective to take on how to optimize knowledge bases for artificial intelligence derives from the needs of the machine learners. Not all individual learners have all of these needs, but from the perspective of a “platform” or “factory” for machine learners, the knowledge base and supporting structures should consider all of these factors:

  • features — are the raw input to machine learners, and may be evident, such as attributes, syntax or semantics, or may be hidden or “latent” [5]. A knowledge base such as Wikipedia can expose literally hundreds of different feature sets [5]. Of course, only a few of these feature types are useful for a given learning task, and many duplicate ones provide nearly similar “signals”. But across multiple learning tasks, many different feature types are desirable and can be made available to learners
  • knowledge graph — is the schematic representation of the KB domain, and is the basis for setting the coherency and logic and inference structure for the represented knowledge. The knowledge graph, provided in the form of an ontology, is the means by which logical slices can be identified and cleaved for such areas as entity type selection or the segregation of training sets. In situations like Wikipedia where the existing category structure is often incoherent, re-expressing the existing knowledge into an existing and proven schema is one viable approach
  • positive and negative training sets — for supervised learning, positive training sets provide a group of labeled, desired outputs, while negative training sets are similar in most respects but do not meet the desired conditions. The training sets provide the labeled outputs to which the machine learner is trained. Provision of both negative and positive sets is helpful, and the accuracy of the learner is in part a function of how “correct” the labeled training sets are
  • reference (“gold”) standards — vetted reference results, which are really validated training sets and therefore more rigorous to produce, are important to test the precision, recall and accuracy of machine learners [6]. This verification is essential during the process of testing the usefulness of various input features as well as model parameters. Without known standards, it is hard to converge many learners for effective predictions

Purposeful Knowledge Bases

  • keeping the KBs current — the nature of knowledge is that is it is constantly changing and growing. As a result, knowledge bases used in KBAI are constantly in flux. The restructuring and feature set generation from the knowledge base must be updated on a periodic basis. Keeping KBs current means that the overall process of staging knowledge bases for KBAI purposes must be made systematic through the use of scripts, build routines and validation tests. General administrative and management capabilities are also critical, and
  • repeatability — all of these steps must be repeatable, since new iterations of the knowledge base must retain coherency and consistency.

Thus, an effective knowledge base to support KBAI should have a number of desirable aspects. It should have maximum structure and exposed features. It should be organized by a coherent knowledge graph, which can be effectively sliced and reasoned over. It must be testable via logic and consistency and performance tests, such that training and reference sets may be vetted and refined. And it must have repeatable and verifiable scripts for updating the underlying KB as it changes and to generate new, working feature sets. Moreover, means for generally managing and manipulating the knowledge base and knowledge graph are important. These desirable aspects constitute a triad of required functionality.

Guidance for a Research Agenda

Achieving a purposeful knowledge base for KBAI uses is neither a stretch nor technically risky. We see the broad outlines of the requirements in the discussion above.

Two-thirds of the triad are relatively straightforward. First, creating a platform for managing the KBs is a fairly standard requirement for knowledge and semantics purposes; many platforms presently exist that can access and manage knowledge graphs, knowledge bases, and instance data at the necessary scales. Second, the build and testing scripts do require systematic attention, but these are also not difficult tasks and are quite common in many settings. It is true that build and testing scripts can often prove brittle, so care needs to be placed into their design to facilitate maintainability. Fortunately, these are matters mostly of proper focus and good practice, and not conceptually difficult.

The major challenges reside in the third leg of the triad, namely in the twin needs to map the knowledge base into a coherent knowledge graph and into an underlying speculative grammar that is logically computable to support the expression of both feature and training sets. A variety of upper ontologies and lexical frameworks such as WordNet have been tried as the guiding graph structures for knowledge bases [7]. To my knowledge, none of these options has been scrutinized with the specific requirements of KBAI support in mind. With respect to the other twin need, that of a speculative grammar, our research to date [8] points to the importance of segregating the KB information into topics (concepts), relations and relation types, entities and entity types, and attributes and attribute types. The possible further distinctions, however, into possibly roles, annotations (metadata), events and expressions of mereology still require further research. The role, placement and use of rules also remain to be determined.

You can see Part I of
this series here.

[1] See, for example, this Google Scholar query: https://scholar.google.com/scholar?q=wikipedia+%22machine+learning%22+OR+%22artificial+intelligence%22. Growth data may be obtained by annual date range searches.
[2] Over the years I have addressed this topic in many articles. A recent sampling from my AI3:::Adaptive Information blog is “Shaping Wikipedia into a Computable Knowledge Base,” (March 31, 2015); and “Creating a Platform for Knowledge-based Machine Intelligence” (September 21, 2015) .
[3] See, for example, M.K. Bergman, 2015. “‘Natural Classes’ in the Knowledge Web,” AI3:::Adaptive Information blog, July 13, 2015.
[4] Knowledge base population, or KBP, first became a topic of research with the track by the same name starting at the Text Analysis Conference sponsored by NIST in 2009. The workshop track has continued annually ever since with greater prominence, and has been the initiative of many projects mining open sources for facts and assertions.
[5] See, for example, M.K. Bergman, 2015. “‘A (Partial) Taxonomy of Machine Learning Features,” AI3:::Adaptive Information blog, November 23, 2015.
[6] See, for example, M.K. Bergman, 2015. “‘A Primer on Knowledge Statistics,” AI3:::Adaptive Information blog, March 18, 2015.
[7] See, for example, M.K. Bergman, 2011. “‘In Search of ‘Gold Standards’ for the Semantic Web,” AI3:::Adaptive Information blog, February 28, 2011.
[8] Two primary articles that I have written on my AI3:::Adaptive Information blog on the information structure of knowledge bases bear on this question; see “Creating a Platform for Knowledge-based Machine Intelligence” (September 21, 2015), and “Conceptual and Practical Distinctions in the Attributes Ontology” (March 3, 2015).
Posted:January 25, 2016

AI Spring, wallpaper from pixelstalk.netArtificial Intelligence is in Bloom; But it Was Not Always So

Anyone beyond a certain age may recall the waning and waxing of the idea of AI, artificial intelligence. In fact, the periodic dismal prospects and poor reputation of artificial intelligence have been severe enough at times so as to warrant its own label: the “AI winters.” Clearly, today, we are in a resurgence of AI. But why is this so? Is the newly re-found popularity of AI merely a change in fashion, or is it due to more fundamental factors? And if it is more fundamental, what might those factors be that have led to this resurgence?

We only need to look at the world around us to see that the resurgence in AI is due to real developments, not a mere change in fashion. From virtual assistants that we can instruct or question by voice command to self-driving cars and face recognition, many mundane or tedious tasks of the past are being innovated away. The breakthroughs are real and seemingly at an increasing pace.

As to the reasons behind this resurgence, more than a year ago, the technology futurist Kevin Kelly got it mostly right when he posited these three breakthroughs [1]:

1. Cheap parallel computation
Thinking is an inherently parallel process, billions of neurons firing simultaneously to create synchronous waves of cortical computation. To build a neural network—the primary architecture of AI software—also requires many different processes to take place simultaneously. Each node of a neural network loosely imitates a neuron in the brain—mutually interacting with its neighbors to make sense of the signals it receives. To recognize a spoken word, a program must be able to hear all the phonemes in relation to one another; to identify an image, it needs to see every pixel in the context of the pixels around it—both deeply parallel tasks. But until recently, the typical computer processor could only ping one thing at a time. . . . That began to change more than a decade ago, when a new kind of chip, called a graphics processing unit, or GPU, was devised for the intensely visual—and parallel—demands of videogames . . . .
2. Big Data
Every intelligence has to be taught. A human brain, which is genetically primed to categorize things, still needs to see a dozen examples before it can distinguish between cats and dogs. That’s even more true for artificial minds. Even the best-programmed computer has to play at least a thousand games of chess before it gets good. Part of the AI breakthrough lies in the incredible avalanche of collected data about our world, which provides the schooling that AIs need. Massive databases, self-tracking, web cookies, online footprints, terabytes of storage, decades of search results, Wikipedia, and the entire digital universe became the teachers making AI smart.
3. Better algorithms
Digital neural nets were invented in the 1950s, but it took decades for computer scientists to learn how to tame the astronomically huge combinatorial relationships between a million—or 100 million—neurons. The key was to organize neural nets into stacked layers. Take the relatively simple task of recognizing that a face is a face. When a group of bits in a neural net are found to trigger a pattern—the image of an eye, for instance—that result is moved up to another level in the neural net for further parsing. The next level might group two eyes together and pass that meaningful chunk onto another level of hierarchical structure that associates it with the pattern of a nose. It can take many millions of these nodes (each one producing a calculation feeding others around it), stacked up to 15 levels high, to recognize a human face. . . .

To these factors I would add a fourth: 4. Distributed architectures (beginning with MapReduce) and new performant datastores (NoSQL, graph DBs, and triplestores). These new technologies, plus some rediscovered, gave us the confidence to tackle larger and larger reference datasets, while also helping us innovate high-performance data representation structures, such as graphs, lists, key-value pairs, feature vectors and finite state transducers. In any case, Kelly also notes the interconnection amongst these factors in the cloud, itself a more general enabling factor. I suppose, too, one could add open source to the mix as another factor.

Still, even though these factors have all contributed, I have argued in my series on knowledge-based artificial intelligence (KBAI) the role of electronic data sets (Big Data) as the most important enabling factor [2]. These reference datasets may range from images for image recognition (such as ImageNet) to statistical compilations from text (such as N-grams or co-occurrences) to more formal representations (such as ontologies or knowledge bases). Knowledge graphs and knowledge bases are the key enablers for AI in the realm of knowledge management and representation.

Some also tout algorithms as the most important source of AI innovation, but Alexander Wissner-Gross in the Edge online magazine comes down squarely on the side of data in AI as the most interesting news in recent science [3]:

. . . perhaps many major AI breakthroughs have actually been constrained by the availability of high-quality training datasets, and not by algorithmic advances. For example, in 1994 the achievement of human-level spontaneous speech recognition relied on a variant of a hidden Markov model algorithm initially published ten years earlier, but used a dataset of spoken Wall Street Journal articles and other texts made available only three years earlier. In 1997, when IBM’s Deep Blue defeated Garry Kasparov to become the world’s top chess player, its core NegaScout planning algorithm was fourteen years old, whereas its key dataset of 700,000 Grandmaster chess games (known as the “The Extended Book”) was only six years old. In 2005, Google software achieved breakthrough performance at Arabic- and Chinese-to-English translation based on a variant of a statistical machine translation algorithm published seventeen years earlier, but used a dataset with more than 1.8 trillion tokens from Google Web and News pages gathered the same year. In 2011, IBM’s Watson became the world Jeopardy! champion using a variant of the mixture-of-experts algorithm published twenty years earlier, but utilized a dataset of 8.6 million documents from Wikipedia, Wiktionary, Wikiquote, and Project Gutenberg updated one year prior. In 2014, Google’s GoogLeNet software achieved near-human performance at object classification using a variant of the convolutional neural network algorithm proposed twenty-five years earlier, but was trained on the ImageNet corpus of approximately 1.5 million labeled images and 1,000 object categories first made available only four years earlier. Finally, in 2015, Google DeepMind announced its software had achieved human parity in playing twenty-nine Atari games by learning general control from video using a variant of the Q-learning algorithm published twenty-three years earlier, but the variant was trained on the Arcade Learning Environment dataset of over fifty Atari games made available only two years earlier.
Examining these advances collectively, the average elapsed time between key algorithm proposals and corresponding advances was about eighteen years, whereas the average elapsed time between key dataset availabilities and corresponding advances was less than three years, or about six times faster, suggesting that datasets might have been limiting factors in the advances.

Seeing these correlations only affirms the importance of looking at knowledge bases from the specific lens of how they may best support training AI machine learners. We see the correlation; it is now time to optimize the expression of these KB potentials. We need to organize the KBs via coherent knowledge graphs and express the KBs in types, entities, attributes and relations representing their inherent, latent knowledge structure. Properly expressed KBs can support creating positive and negative training sets, promote feature set generation and expression, and create reference standards for testing AI learners and model parameters.

Past AI winters arose from lofty claims that were not then realized. Perhaps today’s claims may meet a similar fate.

Yet somehow I don’t think so. The truth is, today, we are seeing rapid progress in AI tasks of increasing usefulness and value all around us. The benefits from what will continue to be seen as ubiquitous AI should now ensure an economic and innovation engine behind AI for many years to come. One way that the AI engine will continue to be fueled is through a systematic understanding of how knowledge bases and their features can work hand in hand with machine learning to more effectively automate and meet our needs.

You can see Part II of
this series here.

[1] Kevin Kelly, 2014. “The Three Breakthroughs That Have Finally Unleashed AI on the World,” in Wired.com, October 27, 2014.
[2] For example, from the perspective of hardware, see Jen-Hsun Huang, 2016. “ Accelerating AI with GPUs: A New Computing Model,” Nvidia blog, January 12, 2016.