Posted:January 20, 2015

Forty Seminal Distant Supervision Articles

Some Annotated References in Relation to Knowledge-based Artificial Intelligence

Distant supervision, earlier or alternatively called self-supervision or weakly-supervised, is a method to use knowledge bases to label entities automatically in text, which is then used to extract features and train a machine learning classifier. The knowledge bases provide coherent positive training examples and avoid the high cost and effort of manual labelling. The method is generally more effective than unsupervised learning, though with similar reduced upfront effort. Large knowledge bases such as Wikipedia or Freebase are often used as the KB basis.

The first acknowledged use of distant supervision was Craven and Kumlien in 1999 (#11 below, though they used the term weak supervision); the first use of the formal term distant supervision was in Mintz et al. in 2009 (#21 below). Since then, the field has been a very active area of research.

Here are forty of the more seminal papers in distant supervision, with annotated comments for many of them:

Alan Akbik, Larysa Visengeriyeva, Priska Herger, Holmer Hemsen, and Alexander Löser, 2012. “Unsupervised Discovery of Relations and Discriminative Extraction Patterns,” in COLING, pp. 17-32. 2012. (Uses a method that discovers relations from unstructured text as well as finding a list of discriminative patterns for each discovered relation. An informed feature generation technique based on dependency trees can significantly improve clustering quality, as measured by the F-score. This paper uses Unsupervised Relation Extraction (URE), based on the latent relation hypothesis that states that pairs of words that co-occur in similar patterns tend to have similar relations. This paper discovers and ranks the patterns behind the relations.)
Marcel Ackermann, 2010. “Distant Supervised Relation Extraction with Wikipedia and Freebase,” internal teaching paper from TU Darmstadt.
Enriique Alfonesca, Katja Filippova, Jean-Yves Delort, and Guillermo Garrido, 2012. “Pattern Learning for Relation Extraction with a Hierarchical Topic Model,” inProceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 54-59. Association for Computational Linguistics, 2012.
Alessio Palmero Aprosio, Claudio Giuliano, and Alberto Lavelli, 2013. “Extending the Coverage of DBpedia Properties using Distant Supervision over Wikipedia,.” In NLP-DBPEDIA@ ISWC. 2013. (Does not suggest amazing results.)
Isabelle Augenstein, Diana Maynard, and Fabio Ciravegna, 2014. “Distantly Supervised Web Relation Extraction for Knowledge Base Population,” in Semantic Web Journal (forthcoming). (The approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co-reference resolution methods.) (Good definitions of supervised, unsupervised, semi-supervised and distant supervised.) (This paper aims to improve the state of the art in distant supervision for Web extraction by: 1) recognising named entities across domains on heterogeneous Web pages by using Web-based heuristics; 2) reporting results for extracting relations across sentence boundaries by relaxing the distant supervision assumption and using heuristic co-reference resolution methods; 3) proposing statistical measures for increasing the precision of distantly supervised systems by filtering ambiguous training data, 4) documenting an entitycentric approach for Web relation extraction using distant supervision; and 5) evaluating distant supervision as a knowledge base population approach and evaluating the impact of our different methods on information integration.)
Pedro HR Assis and Marco A. Casanova, 2014. “Distant Supervision for Relation Extraction using Ontology Class Hierarchy-Based Features,” in ESWC 2014. (Describes a multi-class classifier for relation extraction, constructed using the distant supervision approach, along with the class hierarchy of an ontology that, in conjunction with basic lexical features, improves accuracy and recall.) (Investigates how background data can be even further exploited by testing if simple statistical methods based on data already present in the knowledge base can help to filter unreliable training data.) (Uses DBpedia as source, Wikipedia as target. There is also a YouTube video that may be viewed.)
Isabelle Augenstein, 2014. “Joint Information Extraction from the Web using Linked Data, I. Augenstein’s Ph.D. proposal at the University of Sheffield.
Isabelle Augenstein, 2014. “Seed Selection for Distantly Supervised Web-Based Relation Extraction,” in Proceedings of SWAIE (2014). (Provides some methods for better seed determinations; also uses LOD for some sources.)
Justin Betteridge, Alan Ritter and Tom Mitchell, 2014. “Assuming Facts Are Expressed More Than Once,” in The Twenty-Seventh International Flairs Conference. 2014. (
R. Bunescu, R. Mooney., 2007. “Learning to Extract Relations from the Web Using Minimal Supervision,” in Annual Meeting for the Association for Computational Linguistics, 2007.
Mark Craven and Johan Kumlien. 1999. “Constructing Biological Knowledge Bases by Extracting Information from Text Sources,” in ISMB, vol. 1999, pp. 77-86. 1999. (Source of weak supervision term.)
Daniel Gerber and Axel-Cyrille Ngonga Ngomo, 2012. “Extracting Multilingual Natural-Language Patterns for RDF Predicates,” in Knowledge Engineering and Knowledge Management, pp. 87-96. Springer Berlin Heidelberg, 2012. (The idea behind BOA is to extract natural language patterns that represent predicates found on the Data Web from unstructured data by using background knowledge from the Data Web, specifically DBpedia. See further the code or demo.)
Edouard Grave, 2014. “Weakly Supervised Named Entity Classification,” in Workshop on Automated Knowledge Base Construction (AKBC), 2014. (Uses a novel PU (positive and unlabelled) method for weakly supervised named entity classification, based on discriminative clustering.) (Uses a simple string match between the seed list of named entities and unlabeled text from the specialized domain, it is easy to obtain positive examples of named entity mentions.)
Edouard Grave, 2014. “A Convex Relaxation for Weakly Supervised Relation Extraction,” in Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. (Addressed the multiple label/learning problem. Seems to outperform other state-of-the-art extractors, though the author notes in conclusion that kernel methods should also be tried. See other Graves 2014 reference.)
Malcolm W. Greaves, 2014. “Relation Extraction using Distant Supervision, SVMs, and Probabilistic First Order Logic,” PhD dissertation, Carnegie Mellon University, 2014. (Useful literature review and pipeline is one example.)
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld, 2011. “Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 541-550. Association for Computational Linguistics, 2011. (A novel approach for multi-instance learning with overlapping relations that combines a sentence-level extraction model with a simple, corpus-level component for aggregating the individual facts. (Uses a self-supervised, relation-specific IE system which learns 5025 relations.) (“Knowledge-based weak supervision, using structured data to heuristically label a training corpus, works towards this goal by enabling the automated learning of a potentially unbounded number of relation extractors.” ““weak” or “distant” supervision, creates its own training data by heuristically matching the contents of a database to corresponding text”.) (Also introduces MultiR)
Ander Intxaurrondo, Mihai Surdeanu, Oier Lopez de Lacalle, and Eneko Agirre, 2013. “Removing Noisy Mentions for Distant Supervision,” in Procesamiento del Lenguaje Natural 51 (2013): 41-48. (Suggests filter methods to remove some noisy potential assignments.)
Mitchell Koch, John Gilmer, Stephen Soderland, and Daniel S. Weld, 2014. “Type-Aware Distantly Supervised Relation Extraction with Linked Arguments,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1891–1901, October 25-29, 2014, Doha, Qatar. (Investigates four orthogonal improvements to distance supervision: 1) integrating named entity linking (NEL) and 2) coreference resolution into argument identification for training and extraction, 3) enforcing type constraints of linked arguments, and 4) partitioning the model by relation type signature.) (Enhances the MultiR basis; see http://cs.uw.edu/homes/mkoch/re for code and data.)
Yang Liu, Kang Liu, Liheng Xu, and Jun Zhao, 2014. “Exploring Fine-grained Entity Type Constraints for Distantly Supervised Relation Extraction,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2107–2116, Dublin, Ireland, August 23-29 2014. (More fine-grained entities produce better matching results.)
Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek, 2013. “Distant Supervision for Relation Extraction with an Incomplete Knowledge Base,” in HLT-NAACL, pp. 777-782. 2013. (Standard distant supervision does not properly account for the negative training examples.)
Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, 2009. “Distant Supervision for Relation Extraction without Labeled Data,” in Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 1003–1011, Suntec, Singapore, 2-7 August 2009. (Because their algorithm is supervised by a database, rather than by labeled text, it does not suffer from the problems of overfitting and domain-dependence that plague supervised systems. First use of the ‘distant supervision’ approach.)
Ndapandula T. Nakashole, 2012. “Automatic Extraction of Facts, Relations, and Entities for Web-Scale Knowledge Base Population,” Ph.D. Dissertation for the University of Saarland, 2012. (Excellent overview and tutorial; introduces the tools Prospera, Patty and PEARL.)
Truc-Vien T. Nguyen and Alessandro Moschitti, 2011. “End-to-end Relation Extraction Using Distant Supervision from External Semantic Repositories,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pp. 277-282. Association for Computational Linguistics, 2011. (Shows standard Wikipedia text can also be a source for relations.)
Marius Paşca, 2007. “Weakly-Supervised Discovery of Named Entities Using Web Search Queries,” in Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 683-690. ACM, 2007.
Marius Paşca, 2009. “Outclassing Wikipedia in Open-Domain Information Extraction: Weakly-Supervised Acquisition of Attributes Over Conceptual Hierarchies,” inProceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 639-647. Association for Computational Linguistics, 2009.
Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, and Daniel Jurafsky, 2012. “Event Extraction Using Distant Supervision,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik. 2014. (They demonstrate that the SEARN algorithm outperforms a linear-chain CRF and strong baselines with local inference.)
Sebastian Riedel, Limin Yao, and Andrew McCallum, 2010. “Modeling Relations and their Mentions without Labeled Text,” in Machine Learning and Knowledge Discovery in Databases, pp. 148-163. Springer Berlin Heidelberg, 2010. (They use a factor graph to determine if the two entities are related, then apply constraint-driven semi-supervision.)
Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Etzioni, 2013. “Modeling Missing Data in Distant Supervision for Information Extraction,” TACL 1 (2013): 367-378. (Addresses the question of missing data in distant supervision.) (Appears to address many of the initial MultiR issues.)
Benjamin Roth and Dietrich Klakow, 2013. “Combining Generative and Discriminative Model Scores for Distant Supervision,” in EMNLP, pp. 24-29. 2013.(By combining the output of a discriminative at-least-one learner with that of a generative hierarchical topic model to reduce the noise in distant supervision data, the ranking quality of extracted facts is significantly increased and achieves state-of-the-art extraction performance in an end-to-end setting.)
Benjamin Rozenfeld and Ronen Feldman, 2008. “Self-Supervised Relation Extraction from the Web,” in Knowledge and Information Systems 17.1 (2008): 17-33.
Hui Shen, Mika Chen, Razvan Bunescu and Rada Mihalcea, 2014. “Wikipedia Taxonomic Relation Extraction using Wikipedia Distant Supervision.” (Negative examples based on Wikipedia revision history; perhaps problematic. Interesting recipes for sub-graph extractions. Focused on is-a relationship. See also http://florida.cs.ohio.edu/wpgraphdb/.)
Stephen Soderland, Brendan Roof, Bo Qin, Shi Xu, Mausam, and Oren Etzioni, 2010. “Adapting Open Information Extraction to Domain-Specific Relations,” in AI Magazine 31, no. 3 (2010): 93-102. (A bit more popular treatment; no new ground.)
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning, 2012. “Multi-Instance Multi-Label Learning for Relation Extraction,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 455-465. Association for Computational Linguistics, 2012. (Provides means to find previously unknown relationships using a graph.)
Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa, 2012. “Reducing Wrong Labels in Distant Supervision for Relation Extraction,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 721-729. Association for Computational Linguistics, 2012. (Proposes a method to reduce the incidence of false labels.)
Bilyana Taneva and Gerhard Weikum, 2013. “Gem-based Entity-Knowledge Maintenance,” in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, pp. 149-158. ACM, 2013. (Methods to create the text snippets — GEMS — that are used to train the system.)
Andreas Vlachos and Stephen Clark, 2014. Application-Driven Relation Extraction with Limited Distant Supervision, in COLING 2014 (2014): 1. (Uses the Dagger learning algorithm.)
Wei Xu, Raphael Hoffmann, Le Zhao, and Ralph Grishman, 2013. “Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction,” in ACL (2), pp. 665-670. 2013. (Addresses the problem of false negative training examples mislabeled due to the incompleteness of knowledge bases.)
Wei Xu Ralph Grishman and Le Zhao, 2011. “Passage Retrieval for Information Extraction using Distant Supervision,” in Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 1046–1054, Chiang Mai, Thailand, November 8 – 13, 2011. (Filtering of candidate passages improves quality.)
Y. Yan, N. Okazaki, Y. Matsuo, Z. Yang, M. Ishizuka, 2009. “Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009.
Xingxing Zhang, Jianwen Zhang, Junyu Zeng, Jun Yan, Zheng Chen, and Zhifang Sui, 2013. “Towards Accurate Distant Supervision for Relational Facts Extraction,” in ACL (2), pp. 810-815. 2013. (Three factors on how to improve the accuracy of distant supervision.)

Posted:January 12, 2015

The Era of Openness

The Internet Has Catalyzed Trends that are Creative, Destructive and Transformative

Download PDF

Something very broad and profound has been happening over the recent past. It is not something that can be tied to a single year or a single event. It is also something that is quite complex in that it is a matrix of forces, some causative and some derivative, all of which tend to reinforce one another to perpetuate the trend. The trend that I am referring to is openness, and it is a force that is both creative and destructive, and one that in retrospect is also inevitable given the forces and changes underlying it.

It is hard to gauge exactly when the blossoming of openness began, but by my lights the timing corresponds to the emergence of open source and the Internet. Early bulletin board systems (BBS) often were distributed with source code, and these systems foreshadowed the growth of the Internet. While the Internet itself may be dated to ARPA efforts from 1969, it is really more the development of the Web around 1991 that signaled the real growth of the medium.

Over the past quarter century, the written use of the term “open” has increased more than 40% in frequency in comparison to terms such as “near” or “close” [1], a pretty remarkable change in usage for more-or-less common terms, as this figure shows:

Though the idea of “openness” is less common than “open”, its change in written use has been even more spectacular, with its frequency more than doubling (112%) over the past 25 years. The change in growth slope appears to coincide with the mid-1980s.

Because “openness” is more of a mindset or force — a point of view, if you will — it is not itself a discrete thing, but an idea or concept. In contemplating this world of openness, we can see quite a few separate, yet sometimes related, strands that provide the weave of the “openness” definition [2]:

Open source — refers to a computer program in which the source code is available to the general public for use and/or modification from its original design. Open-source code is typically a collaborative effort where programmers improve upon the source code and share the changes within the community so that other members can help improve it further
Open standards — are standards and protocols that are fully defined and available for use without royalties or restrictions; open standards are often developed in a public, collaborative manner that enables stakeholders to suggest and modify features, with adoption generally subject to some open governance procedures
Open content — is a creative work, generally based on text, that others can copy or modify; open access publications are a special form of open content that provide unrestricted online access to peer-reviewed scholarly research
Open data — is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control; open data is a special form of open content
Open knowledge — is what open data becomes when it is useful, usable and used; according to the Open Knowledge Foundation, the key features of openness are availability and access wherein the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the Internet
Open knowledge bases — are open knowledge packaged in knowledge-base form
Open access to communications — means non-discriminatory means to access communications networks; because of the opennesss of access, additional features might emerge including the idea of crowdsourcing (obtaining content, services or ideas from a large group of people), including such major variants as citizen science or crowdfunding (raising funds from a large group of people)
Open rights — are an umbrella term to cover the ability to obtain content or data without copyright restrictions and gaining use and access to software or intellectual property via open licenses
Open logics — are the use of logical constructs, such as the open world assumption, which enable data and information to be added to existing systems without the need to re-architect the underlying data schema; such logics are important to knowledge management and the continuous additon of new information
Open architectures — are means to access existing software and platforms via such means as open APIs (application programming interfaces), open formats (published specifications for digital data) or open Web services
Open government — is a governing doctrine that holds citizens have the right to access the documents and proceedings of the government to allow for effective public oversight; it is generally accompanied by means for online access to government data and information
Open education — is an institutional practice or programmatic initiative that broadens access to the learning and training traditionally offered through formal education systems, generally to educational materials, curricula or course notes at low or no cost without copyright limitations
Open design — is the development of physical products, machines and systems through use of publicly shared design information, often via online collaboration
Open research — makes the methodology and results of research freely available via the Internet, and often invites online collaboration; if the research is scientific in nature, it is frequently referred to as open science, and
Open innovation — is the use and combination of open and public sources of ideas and innovations with those internal to the organization.

In looking at the factors above, we can ask two formative questions. First, is the given item above primarly a causative factor for “openness” or something that has derived from a more “open” environment? And, second, does the factor have an overall high or low impact on the question of openness. Here is my own plotting of these factors against these dimensions:

Early expressions of the “openness” idea help cause the conditions that lead to openness in other areas. As those areas also become more open, a positive reinforcement is passed back to earlier open factors, all leading to a virtuous circle of increased openness. Though perhaps not strictly “open,” other various and related factors such as the democratization of knowledge, broader access to goods and services, more competition, “long tail” access and phenomenon, and in truly open environments, more diversity and more participation, also could be plotted on this matrix.

Once viewed through the umbrella lens of “openness”, it starts to become clear that all of these various “open” aspects are totally remaking information technology and human interaction and commerce. The impacts on social norms and power and governance are just as profound. Though many innovations have uniquely shaped the course of human history — from literacy to mobility to communication to electrification or computerization — none appear to have matched the speed of penetration nor the impact of “openness”.

Separating the Chicken from the Egg

So, what is driving this phenomenon? From where did the concept of “openness” arise?

Actually, this same matrix helps us hypothesize one foundational story. Look at the question of what is causative and what might be its source. The conclusion appears to be the Internet, specifically the Web, as reinforced and enabled by open-source software.

Relatively open access to an environment of connectivity guided by standard ways to connect and contribute began to fuel still further connections and contributions. The positive values of access and connectivity via standard means, in turn, reinforced the understood value of “openness”, leading to still further connections and engagement. More openness is like the dropped sand grain that causes the entire sand dune to shift.

The Web with its open access and standards has become the magnet for open content and data, all working to promote derivative and reinforcing factors in open knowledge, education and government:

The engine of “openness” tends to reinforce the causative factors that created “openness” in the first place. More knowledge and open aspects of collaboration lead to still further content and standards that lead to further open derivatives. In this manner “openness” becomes a kind of engine that promotes further openness and innovation.

There is a kind of open logic (largely premised on the open world assumption) that lies at the heart of this engine. Since new connections and new items are constantly arising and fueling the openness engine, new understandings are constantly being bolted on to the original starting understandings. This accretive model of growth and development is similar to the depositive layers of pearls or the growth of crystals. The structures grow according to the factors governing the network effect [3], and the nature of the connected growth structures may be represented and modeled as graphs. “Openness” appears to be a natural force underlying the emerging age of graphs [4].

Openness is Both Creative and Destructive . . .

“Openness”, like the dynamism of capitalism, is both creative and destructive [5]. The effects are creative — actually transformative — because of the new means of collaboration that arise based on the new connections between new understandings or facts. “Open” graphs create entirely new understandings as well as provide a scaffolding for still further insights. The fire created from new understandings pulls in new understandings and contributions, all sucking in still more oxygen to keep the innovation cycle burning.

But the creative fire of openness is also destructive. Proprietary software, excessive software rents, silo’ed and stovepiped information stores, and much else are being consumed and destroyed in the wake of openness. Older business models — indeed, existing suppliers — are in the path of this open conflagration. Private and “closed” solultions are being swept before the openness firestorm. The massive storehouse of legacy kindling appears likely to fuel the openness flames for some time to come.

“Openness” becomes a form of adaptive life, changing the nature, value and dynamics of information and who has access to it. Though much of the old economy is — and, will be — swept away in this destructive fire, new and more fecund growth is replacing it. From the viewpoint of the practitioner on the ground, I have not seen a more fertile innovation environment in information technology in more than thirty years of experience.

. . . and Seemingly Inevitable

Once the proper conditions for “openness” were in place, it now seems inevitable that today’s open circumstances would unfold. The Internet, with its (generally) open access and standards, was a natural magnet to attract and promote open-source software and content. A hands-off, unregulated environment has allowed the Internet to innovate, grow, and adapt at an unbelievable rate. So much unconnected dry kindling exists to stoke the openness fire for some time to come.

Of course, coercive state regimes can control the Internet to varying degrees and have limited innovation in those circumstances. Also, any change to more “closed” and less “open” an Internet may also act over time to starve the openness fire. Examples of such means to slow openness include imposing Internet regulation, limiting access (technically, economically or by fiat), moving away from open standards, or limiting access to content. Any of these steps would starve the innovation fire of oxygen.

Adapting to the Era of Openness

The forces impelling openness are strong. But, these observations certainly provide no proof for cause-and-effect. The correspondence of “openness” to the Internet and open source may simply be coincidence. But my sense suggests a more causative role is likely. Further, these forces are strong, and are sweeping before them much in the way of past business practices and proprietary methods.

In all of these regards “openness” is a woven cord of forces changing the very nature and scope of information available to humanity. “Openness”, which has heretofore largely lurked in the background as some unseeing force, now emerges as a criterion by which to judge the wisdom of various choices. “Open” appears to contribute more and be better aligned with current forces. Business models based on proprietary methods or closed information generally are on the losing side of history.

For these forces to remain strong and continue to contribute material benefits, the Internet and its content in all manifestations needs to remain unregulated, open and generally free. The spirit of “open” remains just that, and dependent on open and equal access and rights to the Internet and content.

[1] The data is from Google book trends data based on this query (inspect the resulting page source to obtain the actual data); the years 2009 to 2014 were projected based on prior actuals to 1980; percentage term occurrences were converted to term frequencies by 1/n.

[2] All links and definitions in this section were derived from Wikipedia.

[3] See M.K. Bergman, 2014. “The Value of Connecting Things – Part I: A Foundation Based on the Network Effect,” AI3:::Adaptive Information blog, September 2, 2014.

[4] See M.K. Bergman, 2012. “The Age of the Graph,” AI3:::Adaptive Information blog, August 12, 2012; and John Edward Terrell, Termeh Shafie and Mark Golitko, 2014. “How Networks Are Revolutionizing Scientific (and Maybe Human) Thought,” Scientific American, December 12, 2014.

[5] Creative destruction is a term from the economist Joseph Schumpeter that describes the process of industrial change from within whereby old processes are incessantly destroyed and replaced by new ones, leading to a constant change of economic firms that are winners and losers.

Posted:December 10, 2014

Why Clojure?

Ten Management Reasons for Choosing Clojure for Adaptive Knowledge Apps

It is not unusual to see articles touting one programming language or listing the reasons for choosing another, but they are nearly always written from the perspective of the professional developer. As an executive with much experience directing software projects, I thought a management perspective could be a useful addition to the dialog. My particular perspective is software development in support of knowledge management, specifically leveraging artificial intelligence, semantic technologies, and data integration.

Context is important in guiding the selection of programming languages. C is sometimes a choice for performance reasons, such as in data indexing or transaction systems. Java is the predominant language for enterprise applications and enterprise-level systems. Scripting languages are useful for data migrations and translations and quick applications. Web-based languages of many flavors help in user interface development or data exchange. Every one of the hundreds of available programming languages has a context and rationale that is argued by advocates.

We at Structured Dynamics have recently made a corporate decision to emphasize the Clojure language in the specific context of knowledge management development. I’d like to offer our executive-level views for why this choice makes sense to us. Look to the writings of SD’s CTO, Fred Giasson, for arguments related to the perspective of the professional developer.

Some Basic Management Premises

I have overseen major transitions in programming languages from Fortran or Cobol to C, from C to C++ and then Java, and from Java to more Web-oriented approaches such as Flash, JavaScript and PHP [1]. In none of these cases, of course, was a sole language adopted by the company, but there always was a core language choice that drove hiring and development standards. Making these transitions is never easy, since it affects employee choices, business objectives and developer happiness and productivity. Because of these trade-offs, there is rarely a truly “correct” choice.

Languages wax and wane in popularity, and market expectations and requirements shift over time. Twenty years ago, Java brought forward a platform-independent design well-suited for client-server computing, and was (comparatively) quickly adopted by enterprises. At about the same time Web developments added browser scripting languages to the mix. Meanwhile, hardware improvements overcame many previous performance limitations in favor of easier to use syntaxes and tooling. No one hardly programs for an assembler anymore. Sometimes, like Flash, circumstances and competition may lead to a rapid (and unanticipated) abandonment.

The fact that such transitions naturally occur over time, and the fact that distributed and layered architectures are here to stay, has led to my design premise to emphasize modularity, interfaces and APIs [2]. From the browser, client and server sides we see differential timings of changes and options. It is important that piece parts be able to be swapped out in favor of better applications and alternatives. Irrespective of language, architectural design holds the trump card in adaptive IT systems.

Open source has also had a profound influence on these trends. Commercial and product offerings are no longer monolithic and proprietary. Rather, modern product development is often more based on assembly of a diversity of open source applications and libraries, likely written in a multitude of languages, which are then assembled and “glued” together, often with a heavy reliance on scripting languages. This approach has certainly been the case with Structured Dynamics’ Open Semantic Framework, but OSF is only one example of this current trend.

The trend to interoperating open source applications has also raised the importance of data interoperability (or ETL in various guises) plus reconciling semantic heterogeneities in the underlying schema and data definitions of the contributing sources. Language choices increasingly must recognize these heterogeneities.

I have believed strongly in the importance of interest and excitement by the professional developers in the choice of programming languages. The code writers — be it for scripting or integration or fundamental app development — know the problems at hand and read about trends and developments in programming languages. The developers I have worked with have always been the source of identifying new programming options and languages. Professional developers read much and keep current. The best programmers are always trying and testing new languages.

I believe it is important for management within software businesses to communicate anticipated product changes within the business to its developers. I believe it is important for management to signal openness and interest in hearing the views of its professional developers in language trends and options. No viable software development company can avoid new upgrades of its products, and choices as to architecture and language must always be at the forefront of version planning.

When developer interest and broader external trends conjoin, it is time to do serious due diligence about a possible change in programming language. Tooling is important, but not dispositive. Tooling rapidly catches up with trending and popular new languages. As important to tooling is the “fit” of the programming language to the business context and the excitement and productivity of the developers to achieve that fit.

“Fitness” is a measure of adaptiveness to a changing environment. Though I suppose some of this can be quantified — as it can in evolutionary biology — I also see “fit” as a qualitative, even aesthetic, thing. I recall sensing the importance of platform independence and modularity in Java when it first came out, results (along with tooling) that were soon borne out. Sometimes there is just a sense of “rightness” or alignment when contemplating a new programming language for an emerging context.

Such is the case for us at Structured Dynamics with our recent choice to emphasize Clojure as our core development language. This choice does not mean we are abandoning our current code base, just that our new developments will emphasize Clojure. Again, because of our architectural designs and use of APIs, we can make these transitions seamlessly as we move forward.

However, for this transition, unlike prior ones I have made noted above, I wanted to be explicit as to the reasons and justifications. Presenting these reasons for Clojure is the purpose of this article.

Brief Overview of Clojure

Clojure is a relatively new language, first released in 2007 [3]. Clojure is a dialect of Lisp, explicitly designed to address some Lisp weaknesses in a more modern package suitable to the Web and current, practical needs. Clojure is a functional programming language, which means it has roots in the lamdba calculus and functions are “first-class citizens” in that they can be passed as arguments to other functions, returned as values, or assigned as variables in data structures. These features make the language well suited to mathematical manipulations and the building up of more complicated functions from simpler ones.

Clojure was designed to run on the Java Virtual Machine (JVM) (now expanded to other environments such as ClojureScript and Clojure CLR) rather than any specific operating system. It was designed to support concurrency. Other modern features were added in relation to Web use and scalability. Some of these features are elaborated in the rationales noted below.

As we look to the management reasons for selecting Clojure, we can really lump them into two categories: a) those that arise mostly from Lisp, as the overall language basis; and b) specific aspects added to Clojure that overcome or enhance the basis of Lisp.

Reasons Deriving from Lisp

Lisp (defined as a list processing language) is one of the older computer languages around, dating back to 1958, and has evolved to become a family of languages. “Lisp” has many variants, with Common Lisp one of the most prevalent, and many dialects that have extended and evolved from it.

Lisp was invented as a language for expressing algorithms. Lisp has a very simple syntax and (in base form) comparatively few commands. Lisp syntax is notable (and sometimes criticized) for its common use of parentheses, in a nested manner, to specify the order of operations.

Lisp is often associated with artificial intelligence programming, since it was first specified by John McCarthy, the acknowledged father of AI, and was the favored language for many early AI applications. Many have questioned whether Lisp has any special usefulness to AI or not. Though it is hard to point to any specific reason why Lisp would be particularly suited to artificial intelligence, it does embody many aspects highly useful to knowledge management applications. It was these reasons that caused us to first look at Lisp and its variants when we were contemplating language alternatives:

Open — I have written often that knowledge management, given its nature grounded in the discovery of new facts and connections, is well suited to being treated under the open world assumption [4]. OWA is a logic that explicitly allows the addition of new assertions and connections, built upon a core of existing knowledge. In both optical and literal senses, the Lisp language is an open one that is consistent with OWA. The basis of Lisp has a “feel” much like RDF (Resource Description Framework), the prime language of OWA. RDF is built from simple statements, as is Lisp. New objects (data) can be added easily to both systems. Adding new predicates (functions) is similarly straightforward. Lisp’s nested parentheses also recall Boolean logic, another simple means for logically combining relationships. As with semantic technologies, Lisp just seems “right” as a framework for addressing knowledge problems;
Extensible — one of the manifestations of Lisp’s openness is its extensiblity via macros. Macros are small specifications that enable new functions to be written. This extensibility means that a “bottom up” programming style can be employed [5] that allows the syntax to be expanded with new language to suit the problem, leading in more complete cases to entirely new — and tailored to the problem — domain-specific languages (DSLs). As an expression of Lisp’s openness, this extensibility means that the language itself can morph and grow to address the knowledge problems at hand. And, as that knowledge continues to grow, as it will, so may the language by which to model and evaluate it;
Efficient — Lisp, then, by definition, has a minimum of extraneous code functions and, after an (often steep) initial learning curve, rapid development of appropriate ones for the domain. Since anything Lisp can do to a data structure it can do to code, it is an efficient language. When developing code, Lisp provides a read-eval-print-loop (REPL) environment that allows developers to enter single expressions (s-expressions) and evaluate them at the command line, leading to faster and more efficient ways to add features, fix bugs, and test code. Accessing and managing data is similarly efficient, along with code and data maintenance. A related efficiency with Lisp is lazy evaluation, wherein only the given code or data expression at hand is evaluated as its values are needed, rather than building evaluation structures in advance of execution;
Data-centric — the design aspect of Lisp that makes many of these advantages possible is its data-centric nature. This nature comes from Lisp’s grounding in the lambda calculus and its representation of all code objects via an abstract syntax tree. These aspects allow data to be immutable, and for data to act as code, and code to act as data [6]. Thus, while we can name and reference data or functions as in any other language, they are both manipulated and usable in the same way. Since knowledge management is the combination of schema with instance data, Lisp (or other homoiconic languages) is perfectly suited; and,
Malleable — a criticism of Lisp through the decades has been its proliferation of dialects, and lack of consistency between them. This is true, and Lisp is therefore not likely the best vehicle in its native form for interoperability. (It may also not be the best basis for large-scale, complicated applications with responsive user interfaces.) But for all of the reasons above, Lisp can be morphed into many forms, including the manipulation of syntax. In such malleable states, the dialect maintains its underlying Lisp advantages, but can be differently purposed for different uses. Such is the case with Clojure, discussed in the next section.

Reasons Specific to the Clojure Language

Clojure was invented by Rich Hickey, who knew explicitly what he wanted to accomplish leveraging Lisp for new, more contemporary uses [7]. (Though some in the Lisp community have bristled against the idea that dialects such as Common Lisp are not modern, the points below really make a different case.) Some of the design choices behind Clojure are unique and quite different from the Lisp legacy; others leverage and extend basic Lisp strengths. Thus, with Clojure, we can see both a better Lisp, at least for our stated context, and one designed for contemporary environments and circumstances.

Here are what I see as the unique advantages of Clojure, again in specific reference to the knowledge management context:

Virtual machine design (interoperability) — the masterstroke in Clojure is to base it upon the Java Virtual Machine. All of Lisp’s base functions were written to use the JVM. Three significant advantages accrue from this design. First, Clojure programs can be compiled as jar files and run interactively with any other Java program. In the context of knowledge management and semantic uses, fully 60% of existing applications can now interoperate with Clojure apps [8], an instant boon for leveraging many open source capabilities. Second, certain advantages from Java, including platform independence and the leverage of debugging and profiling tools, among others (see below), are gained. And, third, this same design approach has been applied to integrating with JavaScript (via ClojureScript) and the Common Language Runtime execution engine of Microsoft’s .Net Framework (via Clojure CLR), both highly useful for Web purposes and as exemplars for other integrations;
Scalable performance — by leveraging Java’s multithreaded and concurrent design, plus a form of caching called memoization in conjunction with the lazy evaluation noted above, Clojure gains significant performance advantages and scalability. The immutability of Clojure data has minimal data access conflicts (it is thread-safe), further adding to performance advantages. We (SD) have yet to require such advantages, but it is a comfort to know that large-scale datasets and problems likely have a ready programmatic solution when using Clojure;
More Lispiness — Clojure extends the basic treatment of Lisp’s s-expressions to maps and vectors, basically making all core constructs into extensible abstractions; Clojure is explicit in how metadata and reifications can be applied using Lisp principles, really useful for semantic applications; and Clojure EDN (Extensible Data Notation) was developed as a Lisp extension to provide an analog to JSON for data specification and exchange using ClojureScript [9]. SD, in turn, has taken this approach and extended it to work with the OSF Web services [10]. These extensions go to the core of the Lisp mindset, again reaffirming the malleability and extensibility of Lisp;
Process-oriented — many knowledge management tasks are themselves the results of processing pipelines, or are semi-automatic in nature and require interventions by users or subject matter experts in filtering or selecting results. Knowledge management tasks and the data pre-processing that precedes them often require step-wise processes or workflows. The immutability of data and functions in Lisp means that state is also immutable. Clojure takes advantage of these Lisp constructs to make explicit the treatment of time and state changes. Further, based on Hickey’s background in scheduling systems, a construct of transducers [11] is being introduced in version 1.70. The design of these systems is powerful for defining and manipulating workflows and rule-based branching. Data migration and transformations benefit from this design; and
Active and desirable — hey, developers want to work with this stuff and think it is fun. SD’s clients and partners are also desirous of working with program languages attuned to knowledge management problems, diverse native data, and workflows controlled and directed by knowledge workers themselves. These market interests are key affirmations that Clojure, or dialects similar to it, are the wave of the future for knowledge management programming. Combined with its active developer community, Clojure is our choice for KM applications for the foreseeable future.

Clojure for Knowledge Apps

I’m sure had this article been written from a developer’s perspective, different emphases and different features would have arisen. There is no perfect programming language and, even if there were, its utility would vary over time. The appropriateness of program languages is a matter of context. In our context of knowledge management and artificial intelligence applications, Clojure is our due diligence choice from a business-level perspective.

There are alternatives to the points raised herein, like Scheme, Erlang or Haskell. Scala offers some of the same JVM benefits as noted. Further, tooling for Clojure is still limited (though growing), and it requires Java to run and develop. Even with extensions and DSLs, there is still the initial awkwardness of learning Lisp’s mindset.

Yet, ultimately, the success of a programming language is based on its degree of use and longevity. We are already seeing very small code counts and productivity from our use of Clojure. We are pleased to see continued language dynamism from such developments as Transit [9] and transducers [11]. We think many problem areas in our space — from data transformations and lifting, to ontology mapping, and then machine learning and AI and integrations with knowledge bases, all under the control of knowledge workers versus developers — lend themselves to Clojure DSLs of various sorts. We have plans for these DSLs and look forward to contribute them to the community.

We are excited to now find an aesthetic programming fit with our efforts in knowledge management. We’d love to see Clojure become the go-to language for knowledge-based applications. We hope to work with many of you in helping to make this happen.

[1] I have also been involved with the development of two new languages, Promula and VIQ, and conducted due diligence on C#, Ruby and Python, but none of these languages were ultimately selected.

[2] Native apps on smartphones are likely going through the same transition.

[3] As of the date of this article, Clojure is in version 1.60.

[4] See M. K. Bergman, 2009. ” The Open World Assumption: Elephant in the Room,” December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.

OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption.

Also, you can search on OWA on this blog.

[5] Paul Graham, 1993. “Programming Bottom-Up,” is a re-cap on Graham’s blog related to some of his earlier writings on programming in Lisp. By “bottom up” Graham means “. . . changing the language to suit the problem. . . . Language and program evolve together.”

[6] A really nice explanation of this approach is in James Donelan, 2013. “Code is Data, Data is Code,” on the Mulesoft blog, September 26, 2013.

[7] Rich Hickey is a good public speaker. Two of his seminal videos related to Clojure are “Are We There Yet?” (2009) and “Simple Made Easy” (2011).

[8] My Sweet Tools listing of knowledge management software is dominated by Java, with about half of all apps in that language.

[9] See the Clojure EDN; also Transit, EDN’s apparent successor.

[10] structEDN is a straightforward RDF serialization in EDN format.

[11] For transducers in Clojure version 1.70, see this Hickey talk, “Transducers” (2014).

Posted:December 9, 2014

New OSF version 3.1 and OSF Web Site

Structured Dynamics has just released version 3.1 of the Open Semantic Framework and announced the update of the OSF Web site. OSF is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components. OSF is made available under the Apache 2 license.

Enhancements to OSF version 3.1 include:

Upgrade to use the Virtuoso Open Source version 7.1.0 triple store;
A new API for Clojure developers: clj-osf;
A sandbox where each of the nearly 30 OSF Web services may be invoked and tested;
A variety of bug fixes and minor functional improvements; and
An updated and improved OSF installer.

OSF version 3.1 is available for download from GitHub.

More details on the release can be found on Frédérick Giasson’s blog. Fred is OSF’s lead developer. William (Bill) Anderson also made key contributions to this release.

Posted:November 17, 2014

Knowledge-based Artificial Intelligence

What Goes Around, Comes Around, Only Now with Real Knowledge

Download PDF

A recent interview with a noted researcher, IEEE Fellow Michael I. Jordan, Pehong Chen Distinguished Professor at the University of California, Berkeley, provided a downplayed view of recent AI hype. Jordan was particularly critical of AI metaphors to real brain function and took the air out of the balloon about algorithm advances, pointing out that most current methods have roots that are decades long [1]. In fact, the roots of knowledge-based artificial intelligence (KBAI), the subject of this article, also extend back decades.

Yet the real point, only briefly touched upon by Jordan in his lauding of Amazon’s recommendation service, is that the dynamo in recent AI progress has come from advances in the knowledge and statistical bases driving these algorithms. The improved digital knowledge bases behind KBAI have been the power behind these advances.

Knowledge bases are finally being effectively combined with AI, a dynamic synergy that is only now being recognized, let alone leveraged. As this realization increases, many forms of useful information structure in the wild will begin to be mapped to these knowledge bases, which will further extend the benefits we are are now seeing from KBAI.

Knowledge-based artificial intelligence, or KBAI, is the use of large statistical or knowledge bases to inform feature selection for machine-based learning algorithms used in AI. The use of knowledge bases to train the features of AI algorithms improves the accuracy, recall and precision of these methods. This improvement leads to perceptibly better results to information queries, including pattern recognition. Further, in a virtuous circle, KBAI techniques can also be applied to identify additional possible facts within the knowledge bases themselves, improving them further still for KBAI purposes.

It is thus, in my view, the combination of KB + AI that has led to the notable AI breakthroughs of the past, say, decade. It is in this combination that we gain the seeds for sowing AI benefits in other areas, from tagging and disambiguation to the complete integration of text with conventional data systems. And, oh, by the way, the structure of all of these systems can be made inherently multi-lingual, meaning that context and interpretation across languages can be brought to our understanding of concepts.

Structured Dynamics is working to democratize a vision of KBAI that brings its benefits to any enterprise, using the same approaches that the behemoths of the industry have used to innovate knowledge-based artificial intelligence in the first place. How and where the benefits of such KBAI may apply is the subject of this article.

A Brief History of Knowledge-based Systems

Knowledge-based artificial intelligence is not a new idea. Its roots extend back perhaps to one of the first AI applications, Dendral. In 1965, nearly a half century ago, Edward Feigenbaum initiated Dendral, which became a ten-year effort to develop software to deduce the molecular structure of organic compounds using scientific instrument data. Dendral was the first expert system and used mass spectra or other experimental data together with a knowledge base of chemistry to produce a set of possible chemical structures. This set the outline for what came to be known as knowledge-based systems, which are one or more computer programs that reason and use knowledge bases to solve complex problems.

Indeed, it was in the area of expert systems that AI first came to the attention of most enterprises. According to Wikipedia,

Expert systems were designed to solve complex problems by reasoning about knowledge, represented primarily as if–then rules rather than through conventional procedural code. The first expert systems were created in the 1970s and then proliferated in the 1980s. Expert systems were among the first truly successful forms of AI software.

Expert systems spawned the idea of knowledge engineers, whose role was to interview and codify the logic of the chosen experts. But, expert systems proved to be expensive to build and difficult to maintain and tune. As the influence of expert systems waned, another branch emerged, that of knowledge-based engineering and their support for CAD– and CASE-type systems. Still, overall penetration to date of most knowledge-based systems can most charitably be described as disappointing.

The specific identification of “KBAI” was (to my knowledge) first made in a Carnegie-Mellon University report to DARPA in 1975 [2]. The source knowledge bases were broadly construed, including listings of hypotheses. The first known patent citing knowledge-based artificial intelligence is from 1992 [3]. Within the next ten years there were dedicated graduate-level course offerings on KBAI at many universities, including at least Indiana University, SUNY Buffalo, and Georgia Tech.

In 2007, Bossé et al. devoted a chapter to KBAI in their book on information fusion, but still, at that time, the references were more generic [4]. However, by 2013, the situation was changing fast, as this quote from Hovy et al. indicates [5]:

“Recently, however, this stalemate [the so-called ‘knowledge acquisition bottleneck] has begun to loosen up. The availability of large amounts of wide-coverage semantic knowledge, and the ability to extract it using powerful statistical methods, are enabling significant advances in applications requiring deep understanding capabilities, such as information retrieval and question-answering engines. Thus, although the well-known problems of high cost and scalability discouraged the development of knowledge-based approaches in the past, more recently the increasing availability of online collaborative knowledge resources has made it possible to tackle the knowledge acquisition bottleneck by means of massive collaboration within large online communities. This, in turn, has made these online resources one of the main driving forces behind the renaissance of knowledge-rich approaches in AI and NLP – namely, approaches that exploit large amounts of machine-readable knowledge to perform tasks requiring human intelligence.” (citations removed)

The waxing and waning of knowledge-based systems and its evolution over fifty years have led to a pretty well-defined space, even if not all component areas have achieved their commercial potential. Besides areas already mentioned, knowledge-based systems also include:

Knowledge models — formalisms for knowledge representation and reasoning, and
Reasoning systems — software that generates conclusions from available knowledge using logical techniques such as deduction and induction.

We can organize these subdomains as follows. Note particularly that the branch of KBAI (knowledge-based artificial intelligence) has two main denizens: recognized knowledge bases, such as Wikipedia, and statistical corpora. The former are familiar and evident around us; the latter are largely proprietary and not (generally) publicly accessible:

Some prominent knowledge bases and statistical corpora are identified below. Knowledge bases are coherently organized information with instance data for the concepts and relationships covered by the domain at hand, all accessible in some manner electronically. Knowledge bases can extend from the nearly global, such as Wikipedia, to very specific topic-oriented ones, such as restaurant reviews or animal guides. Some electronic knowledge bases are designed explicitly to support digital consumption, in which case they are fairly structured with defined schema and standard data formats and, increasingly, APIs. Others may be electronically accessible and highly relevant, but the data is not staged in a easily-consumable way, thereby requiring extraction and processing prior to use.

The use and role of statistical corpora is harder to discern. Statistical corpora are organized statistical relationships or rankings that facilitate the processing of (mostly) textual information. Uses can range from entity extraction to machine language translation. Extremely large sources, such as search engine indexes or massive crawls of the Web, are most often the sources for these knowledge sets. But, most are applied internally by those Web properties that control this big data.

The Web is the reason these sources — both statistical corpora and knowledge bases — have proliferated, so the major means of consuming them is via Web services with the information defined and linked to URIs.

My major thesis has been that it is the availability of electronically accessible knowledge bases, exemplified and stimulated by Wikipedia [6], that has been the telling factor in recent artificial intelligence advances. For example, there are at least 500 different papers that cite using Wikipedia for various natural language processing, artificial intelligence, or knowledge base purposes [7]. These papers began to stream into conferences about 2005 to 2006, and have not abated since. In turn, the various techniques innovated for extracting more and more structure and information from Wikipedia are being applied to other semi-structured knowledge bases, resulting in a true renaissance of knowledge-based processing for AI purposes. These knowledge bases are emerging as the information substrate under many recent computational advances.

Knowledge Bases in Relation to Overall Artificial Intelligence

A few months ago I pulled together a bit of an interaction diagram to show the relationships between major branches of artificial intelligence and structures arising from big data, knowledge bases, and other organizational schema for information:

What we are seeing is a system emerging whereby multiple portions of this diagram interact to produce innovations. Take, for example, Apple‘s Siri [8], or Google’s Google Now or the many similar systems that have emerged on smartphones. Spoken instructions are decoded to text, which is then parsed and evaluated for intent and meaning and then posed to a general knowledge base. The text results are then modulated back to speech with the answer in the smartphone’s speakers. The pattern recognition at the front and back end of this workflow has been made better though statistical datasets derived from phonemes and text. The text understanding is processed via natural language processing and semantic technologies [16], with the question understanding and answer formulation coming from one or more knowledge bases.

This remarkable chain of processing is now almost taken for granted, though its commercial use is less than five years old. For different purposes with different workflows we see effective question answering and diagnosis with systems like IBM’s Watson [9] and structured search results from Google’s Knowledge Graph [10]. Try posing some questions to Wolfram Alpha and then stand back and be impressed with the data visualization. Behind the scenes, pattern recognition from faces to general images or thumbprints further is eroding the distinction between man and machine. Google Translate now covers language translation between 60 human languages [11] — and pretty damn effectively, too. All major Web players are active in these areas, from Amazon’s recommendation system [12] to Facebook [42], Microsoft [13], Twitter [14] or Baidu [15].

Though not universal, most all recent AI advances leveraging knowledge bases have utilized Wikipedia in one way or another. Even Freebase, the core of Google’s Knowledge Graph, did not really blossom as a separate data crowdsourcing concern until its former owner, Metaweb, decided to bring Wikipedia into its system. Many other knowledge bases, as noted below, are also derivatives or enhancements to Wikipedia in one way or another.

I believe the reasons for Wikipedia’s influence have arisen from its nearly global scope, its mix of semi-structured data and text, its nearly 200 language versions, and its completely open and accessible nature. Regardless, it is also certainly true that techniques honed with Wikipedia are now being applied to a diversity of knowledge bases. We are also seeing an appreciation start to grow in how knowledge bases can enhance the overall AI effort.

Useful Statistical and Knowledge Sources

The diagram on knowledge-based systems above shows two kinds of databases contributing to KBAI: statistical corpora or databases and true knowledge bases. The statistical corpora tend to be hidden behind proprietary curtains, and also more limited in role and usefulness than general knowledge bases.

Statistical Corpora

The statistical corpora or databases tend to be of a very specific nature. While lists of text corpora and many other things may contribute to this category, the ones actually in commercial use tend to be quite focused in scope, very large, and designed for bespoke functionality. A good example, and one that has been contributed for public use, is the Web 1T 5-gram data set [17]. This data set, contributed by Google for public use in 2006, contains English word n-grams and their observed frequency counts. N-grams capture word tokens that often coincide with one another, from single words to phrases. The length of the n-grams ranges from unigrams (single words) to five-grams. The database was generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.

Another example of statistical corpora are what is used in Google’s Translate capabilities [11]. According to Franz Josef Och, who was the lead manager at Google for its translation activities and an articulate spokesperson for statistical machine translation, a solid base for developing a usable language translation system for a new pair of languages should consist of a bilingual text corpus of more than a million words, plus two monolingual corpora each of more than a billion words. Statistical frequencies of word associations form the basis of these reference sets. Google originally seeded its first language translators with multiple language texts from the United Nations [18].

Such lookup or frequency tables in fact can shade into what may be termed a knowledge base as they gain more structure. NELL, for example (and see below), contains a relatively flat listing of assertions extracted from the Web for various entities; it goes beyond frequency counts or relatedness, but does not have the full structure of a general knowledge base like Wikipedia [19]. We thus can see that statistical corpora and knowledge bases in fact reside on a continuum of structure, with no bright line to demark the two categories.

Nonetheless, most statistical corpora will never be seen publicly. Building them requires large amounts of input information. And, once built, they can offer significant commercial value to their developers to drive various machine learning systems and for general lookup.

Knowledge Bases

There are literally hundreds of knowledge bases useful to artificial intelligence, most of a restricted domain nature. Listed below, partially informed by Suchanek and Weikum’s work [20], are some of the broadest ones available. Note that many leverage or are derivatives of or extensions to Wikipedia:

BabelNet — is a multilingual lexicalized semantic network and ontology automatically created by linking Wikipedia to WordNet [21]
Biperpedia — is an ontology with 1.6M (class, attribute) pairs and 67K distinct attribute names, a totally unique resource, but one that is not publicly available [22]
ConceptNet — is a semantic network with concepts as nodes and edges that are assertions of common sense about these concepts [23]
Cyc — is an artificial intelligence project that attempts to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal of enabling AI applications to perform human-like reasoning [24]
DBpedia — extracts structured content from the information created as part of the Wikipedia, principally from its infoboxes [25]
DeepDive — employs statistical learning and inference to combine diverse data resources and best-of-breed algorithms in order to construct knowledge bases from hundreds of millions of Web pages [26]
EntityCube — is a knowledge base built from the statistical extraction of structured entities, named entities, entity facts and relations from the Web [27]
Freebase — is a large collaborative knowledge base consisting of metadata composed mainly by its community members, but centered initially on Wikipedia; Freebase is a key input component to Google’s Knowledge Graph [28]
GeoNames — is a geographical database that contains over 10,000,000 geographical names corresponding to over 7,500,000 unique features [29]
ImageNet — is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images [30]
KnowItAll — is a variety of domain-independent systems that extract information from the Web in an autonomous, scalable manner [31]
Knowledge Vault (Google) — is a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories; it is not publicly available [32]
LEVAN (Learning About Anything) — is a fully-automated approach for learning extensive models for a wide range of variations (e.g., actions, interactions, attributes and beyond) from images for any concept, leveraging the vast resources of online books [43]
NELL (Never-Ending Language Learning system) — is a semantic machine learning system developed at Carnegie Mellon University that identifies a basic set of fundamental semantic relationships between a few hundred predefined categories of data, such as cities, companies, emotions and sports teams [19]
Probase — is a universal, probabilistic taxonomy that contains 2.7 million concepts harvested from a corpus of 1.68 billion web pages [33]
UMBEL — is an upper ontology of about 28,000 reference concepts and a vocabulary for aiding that ontology mapping, including expressions of likelihood relationships [34]
Wikidata — is a common source of certain data types (for example, birth dates) which can be used by Wikimedia projects such as Wikipedia [35]
WikiNet — is a multilingual extension of the facts found in the multiple language versions of Wikipedia [36]
WikiTaxonomy — is a large-scale taxonomy derived from the category relationships in Wikipedia [37]
Wolfram Alpha — is a computational knowledge engine made available by subscription as an online service that answers factual queries directly by computing the answer from externally sourced “curated data”
WordNet — is a lexical database for the English language that groups words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members [38]
YAGO — is extracted from Wikipedia (e.g., categories, redirects, infoboxes), WordNet (e.g., synsets, hyponymy), and GeoNames [39].

What Work is Being Done and the Future

It is instructive to inspect what kinds of work or knowledge these bases are contributing to the AI enterprise. The most important contribution, in my mind, is structure. This structure can relate to the subsumption (is-a) or part of (mereology) relationships between concepts. This form of structure contributes most to understanding the taxonomy or schema of a domain; that is, its scaffolding of concepts. This structure helps orient the instance data and other external structures, generally through some form of mapping.

The next rung of contribution from these knowledge bases is in the nature of the relations between concepts and their instances. These form the predicates or nature of the relationships between things. This kind of contribution is also closely related to the attributes of the concepts and the properties of the things that populate the structure. This kind of information tends to be the kind of characteristics that one sees in a data record: a specific thing and the values for the fields by which it is described.

Another contribution from knowledge bases comes from identity and disamgibuation. Identity works in that we can point to authoritative references (with associated Web identifiers) for all of the individual things and properties in our relevant domain. We can use these identities to decide the “canonical” form, which also gives us a common reference for referring to the same things across information sources or data sets. We also gain the means for capturing the various ways that anything can be described, that is the synonyms, jargon, slang, acronyms or insults that might be associated with something. That understanding helps us identify the core item at hand. When we extend these ideas to the concepts or types that populate our relevant domain, we can also begin to establish context and other relationships to individual things. When we encounter the person “John Smith” we can use co-occurring concepts to help distinguish John Smith the plumber from John Smith the politician or John Smith the policeman. As more definition and structure is added, our ability to discriminate and disambiguate goes up.

Some of this work resides in the interface between schema (concepts) and the instances (individuals) that populate that schema, what I elsewhere have described as the work between the T-Box and A-Box of knowledge bases [40]. In any case, with richer understandings of how we describe and discern things, we can now begin to do new work, not possible when these understandings were lacking. We can now, for example, do semantic search where we can relate multiple expressions for the same things or infer relationships or facets that either allow us to find more relevant items or better narrow our search interests.

With true knowledge bases and logical approaches for working with them and their structure, we can begin doing direct question answering. With more structure and more relationships, we can also do so in rather sophisticated ways, such as identifying items with multiple shared characteristics or within certain ranges or combinations of attributes.

Structured information and the means to query it now gives us a powerful, virtuous circle whereby our knowledge bases can drive the feature selection of AI algorithms, while those very same algorithms can help find still more features and structure in our knowledge bases. The interaction between AI and the KBs means we can add still further structure and refinement to the knowledge bases, which then makes them still better sources of features for informing the AI algorithms:

Once this threshold of feature generation is reached, we now have a virtuous dynamo for knowledge discovery and management. We can use our AI techniques to refine and improve our knowledge bases, which then makes it easier to improve our AI algorithms and incorporate still further external information. Effectively utilized KBAI thus becomes a generator of new information and structure.

This virtuous circle has not yet been widely applied beyond the early phases of, say, adding more facts to Wikipedia, as some of our examples above show. But these same basic techniques can be applied to the very infrastructural foundations of KBAI systems in such areas as data integration, mapping to new external structure and information, hypothesis testing, diagnostics and predictions, and the myriad of other uses to which AI has been hoped to contribute for decades. The virtuous circle between knowledge bases and AIs does not require us to make leaps and bounds improvements in our core AI algorithms. Rather, we need only stoke our existing AI engines with more structure and knowledge fuel in order to keep the engine chugging.

The vision of a growing nexus of KBAI should also prove that efficiencies and benefits also increase through a power function of the network effect, similar to what I earlier described in the Viking algorithm [41]. We know how we can extract further structure and benefit from Wikipedia. We can see how such a seed catalyst can also be the means for mapping and relating more specific domain knowledge bases and structure. The beauty of this vision is that we already can see the threshold benefits from a decade of KBAI development. Each new effort — and there are many — will only act to add to these benefits, with each new increment contributing more than the increment that came before. That sounds to me like productivity, and a true basis for wealth creation.

[1] See Lee Gomes, 2014. “Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Efforts,” in IEEE Spectrum, 20 Oct 2014. See also http://www.kdnuggets.com/2014/11/berkeley-michael-jordan-big-data-transformative-not-delusion.html.

[2] Lee D. Erman and Victor R. Lesser, 1975. A Multi-level Organization for Problem Solving Using Many Diverse, Cooperating Sources of Knowledge, DARPA Report AD-AO12-919, Carnegie-Mellon University, Pittsburgh, Pa. 15213 , March, 1975, 24 pp. See http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA012916. The authors followed this up with a 1978 book, System engineering techniques for artificial intelligence systems. Academic Press, 1978.

[3] See “Method for converting a programmable logic controller hardware configuration and corresponding control program for use on a first programmable logic controller to use on a second programmable logic controller,” US Patent No 5142469 A, August 25, 1992.

[4] Éloi Bossé, Jean Roy, Steve Wark, Eds., 2007. Concepts, Models, and Tools for Information Fusion, Artech House Publishers, 2007-02-28, SKU-13/ISBN: 9781596939349. Chapter 11 is devoted to “Knowledge-Based and Artificial Intelligence Systems.”

[5] Eduard Hovy, Roberto Navigli, and Simone Paolo Ponzetto, 2013. “Collaboratively Built Semi-Structured Content and Artificial Intelligence: The Story So Far,” Artificial Intelligence 194 (2013): 2-27. See http://wwwusers.di.uniroma1.it/~navigli/pubs/AIJ_2012_Hovy_Navigli_Ponzetto.pdf.

[6] See M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010. The listing as of its last update included 246 articles.

[7] See the combined references between [6] and [20]; also, see Wikipedia’s own “Wikipedia in Academic Studies.”

[8] Wade Roush, 2010. “The Story of Siri, from Birth at SRI to Acquisition by Apple–Virtual Personal Assistants Go Mobile.” Xconomy. com 14 (2010).

[9] Special Issue on “This is Watson“, 2012. IBM Journal of Research and Development 56(3/4), May/June 2012. See also this intro to Watson video.

[10] Amit Singhal, 2012. ” Introducing the Knowledge Graph: Things, not Strings,” Google Blog, May 16, 2012

[11] Thomas Schulz, 2013. “Google’s Quest to End the Language Barrier,” September 13, 2013, Spiegel Online International. See http://www.spiegel.de/international/europe/google-translate-has-ambitious-goals-for-machine-translation-a-921646.html. Also see, Alon Halevy, Peter Norvig, and Fernando Pereira, 2009. “The Unreasonable Effectiveness of Data,” in IEEE Intelligent Systems, March/April 2009, pp 8-12.

[12] Greg Linden, Brent Smith, and Jeremy York, 2003. “Amazon. com Recommendations: Item-to-item Collaborative Filtering,” Internet Computing, IEEE 7, no. 1 (2003): 76-80.

[13] Athima Chansanchai, 2014. “Microsoft Research Shows off Advances in Artificial Intelligence with Project Adam,” on Next at Microsoft blog, July 14, 2014. Also, see [33].

[14] Jimmy Lin and Alek Kolcz, 2012. “Large-Scale Machine Learning at Twitter,” SIGMOD, May 20–24, 2012, Scottsdale, Arizona, US. See http://www.dcs.bbk.ac.uk/~DELL/teaching/cc/paper/sigmod12/p793-lin.pdf.

[15] Robert Hof, 2014. “Interview: Inside Google Brain Founder Andrew Ng’s Plans To Transform Baidu,” Forbes Online, August 28, 2014.

[16] Alok Prasad and Lee Feigenbaum, 2014, “How Semantic Web Tech Can Make Big Data Smarter,” in CMSwire, Oct 6, 2014.

[17] The Web 1T 5-gram data set is available from the Linguistic Data Corporation, University of Pennsylvania.

[18] Franz Josef Och, 2005. “Statistical Machine Translation: Foundations and Recent Advances,” The Tenth Machine Translation Summit, Phuket, Thailand, September 12, 2005.

[19] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr, and Tom M. Mitchell, 2010. “Toward an Architecture for Never-Ending Language Learning,” in AAAI, vol. 5, p. 3. 2010.

[20] Fabian M. Suchanek and Gerhard Weikum, 2014. “Knowledge Bases in the Age of Big Data Analytics,” Proceedings of the VLDB Endowment 7, no. 13 (2014).

[21] Roberto Navigli and Simone Paolo Ponzetto, 2012. “BabelNet: The Automatic Construction, Evaluation and Application of a Wide-coverage Multilingual Semantic Network,” Artificial Intelligence 193 (2012): 217-250.

[22] Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu, 2014. “Biperpedia: An Ontology for Search Applications,” . Proceedings of the VLDB Endowment 7(7), 2014.

[23] Robert Speer and Catherine Havasi, “Representing General Relational Knowledge in ConceptNet 5,” LREC 2012, pp. 3679-3686.

[24] Douglas B. Lenat and R. V. Guha, 1990. Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project, Addison-Wesley, 1990 ISBN 0-201-51752-3.

[25] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives 2007. “DBpedia: A Nucleus for a Web of Open Data,” presented at ISWC 2007, Springer Berlin Heidelberg, 2007.

[26] Feng Niu, Ce Zhang, Christopher Ré, and Jude W. Shavlik, 2012. “DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference,” in VLDS, pp. 25-28. 2012.

[27] Zaiqing Nie, J-R. Wen, and Wei-Ying Ma, 2012. ” Statistical Entity Extraction From the Web,” in Proceedings of the IEEE 100, no. 9 (2012): 2675-2687.

[28] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor, 2008. “Freebase: a Collaboratively Created Graph Database for Structuring Human Knowledge,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247-1250. ACM, 2008.

[29] Mark Wick and Bernard Vatant, 2012. “The GeoNames Geographical Database,” available from the World Wide Web at http://geonames.org (2012).

[30] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, 2009. ” ImageNet: A Large-scale Hierarchical Image Database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248-255.

[31] Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates, 2005. “Unsupervised Named-entity Extraction from the Web: An Experimental Study,” Artificial Intelligence 165, no. 1 (2005): 91-134.

[32] Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun and Wei Zhang, 2014. “Knowledge Vault: a Web-scale Approach to Probabilistic Knowledge Fusion,” KDD 2014.

[33] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu, 2012. “Probase: A Probabilistic Taxonomy for Text Understanding,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481-492. ACM, 2012.

[34] M.K. Bergman, 2008. “UMBEL: A Subject Concepts Reference Layer for the Web”, Slideshare, July 2008.

[35] Denny Vrandečić and Markus Krötzsch, 2014. “Wikidata: A Free Collaborative Knowledgebase,” Communications of the ACM 57, no. 10 (2014): 78-85.

[36] Vivi Nastase and Michael Strube. 2013. “Transforming Wikipedia into a Large Scale Multilingual Concept Network,” Artificial Intelligence 194 (2013): 62-85.

[37] Simone Paolo Ponzetto, and Michael Strube, 2007. “Deriving a Large Scale Taxonomy from Wikipedia,” in AAAI, vol. 7, pp. 1440-1445. 2007.

[38] Christiane Fellbaum and George Miller, eds., 1998. WordNet: An Electronic Lexical Database, MIT Press, 1998.

[39] Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum, 2013. “YAGO2: a Spatially and Temporally Enhanced Knowledge Base from Wikipedia,” Artificial Intelligence 194 (2013): 28-61; and Fabien M. Suchanek, Gjergji Kasneci, Gerhard Weikum, 2007. “YAGO: a Core of Semantic Knowledge,” in Proceedings of the 16th international Conference on World Wide Web, pp. 697-706. ACM, 2007.

[40] See the T-Box and A-Box discussion with particular reference to artificial intelligence on M.K. Bergman, 2014. “Big Structure and Data Interoperability,” on AI3:::Adaptive Information blog, August 18, 2014.

[41] M.K. Bergman, 2014. “The Value of Connecting Things – Part II: The Viking Algorithm,” on AI3:::Adaptive Information blog, September 14, 2014.

[42] Cade Metz, 2013. “Facebook Taps ‘Deep Learninig’ Giant for New AI Lab,” on Wired.com, December 9, 2013.

[43] Santosh Divvala, Ali Farhadi and Carlos Guestrin, 2014. “Learning Everything about Anything: Webly-Supervised Visual Concept Learning,” in CVPR 2014.

Main Links

Search

Author: Mike Bergman