CWPK #64: Embeddings, Summarization and Entity RecognitionAI3:::Adaptive InformationAI3:::Adaptive Information

Some Machine Learning Applied to NLP Problems

Cooking with KBpedia

In the last installment of Cooking with Python and KBpedia we collected roughly 45,000 articles from Wikipedia that match KBpedia reference concepts. We then did some pre-processing of the text using the gensim package to lower case the text, remove stop words, and identify bi-gram and tri-gram phrases. These types of functions and extractions are a strength of gensim, which should be part of your pre-processing arsenal.

It is now time for us to process the natural language of this text and to use it for creating word and document embedding models. For the later, we will continue to use gensim. For the former, however, we will introduce and use a very powerful NLP package, spaCy. As noted earlier, spaCy has a clear function orientation to NLP and also has some powerful extension mechanisms.

Our plan of attack in this installment is to finish the word embeddings with gensim, and then move on to explore the spaCy package. We will not explore all aspects of this package, but will focus on text summarization, and (named) entity recognition using both models and rule-based.

Word and Document Embedding

As we have noted in CWPK #61, there exist many pre-calculated word and document embedding models. However, because of the clean scope of KBpedia and our interest in manipulating embeddings for various learning models, we want to create our own embeddings.

There are many scopes and methods for creating embeddings. Embeddings, you recall, are a vector representation of information in a reduced dimensional space. Embeddings are a proven way to represent sparse matrix information like language (meaning many dimensions of words and phrases matched to one another) in a more efficient coding format usable by a computer. Embedding scopes may range from words, phrases, sentences, paragraphs, sections of documents, or documents, as well as senses, topics, sentiments, categories or other relations that might cut across a given corpus. Methods may range from sequences to counts to simple statistics or all the way up to deep learning with neural nets. Of late, a combination method of converging encoders and decoders called ‘transformers’ has been the rage, with BERT and ELMo two prominent instantiations.

Because we already have been exercising the gensim package, we decide to proceed with our own word embedding and document embedding models. From gensim documentation, we first prepare up a word2vec model:

NOTE: Due to GitHub’s file size limits, the various text file inputs referenced in this installment may be found on the KBpedia site as zipped files (for example, https://kbpedia.org/cwpk-text/wikipedia-trigram.zip for the input file mentioned next). Due to their very large sizes, you will need to create locally all of the models mentioned in this installment (with *.vec or *.model extensions).

import sys
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-trigram.txt'
out_model = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-w2v.model'
out_vec = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-w2v.vec'

input = smart_open(in_f, 'r', encoding='utf-8')
walks = LineSentence(input)
model = Word2Vec(walks, size=300, window=5, min_count=3, hs=1, sg=1, workers=5, negative=5, iter=10)
model.save(out_model)
model.wv.save_word2vec_format(out_vec, binary=False)

This works pretty slick and only requires a few lines of code. The model run takes about 2:15 hrs on my laptop; to process the entire Wikipedia English corpus reportedly takes about 9 hrs. Note we begin the training process with our tri-gram input corpus created in the last installment.

A few of the model parameters deserve some commentary. The size parameter is one of the most important. It sets the number of dimensions over which you want to capture a correspondence statistic, what is the actual dimension reduction at the core of the entire exercise. Remember, a collocation matrix is a very sparse one for natural language. In the case of how I have set up the Wikipedia pages from KBpedia so far with stoplist and trigrams and such, our current corpus has 1.3 million tokens, which is really sparse when you extend the second dimension by this same amount. The size parameter beyond hundreds of dimensions works to greatly increase the computation time in training as well as (perhaps paradoxically) lowering accuracy. The window parameter is the word count to either side of the current token for which adjacency is calculated, so that a window of five actually encompasses a string of eleven tokens, the subject token and five to either side. min-counts is the minimum number of occurrences for a given token (including phrases or ngrams as individual tokens). sg in this case is invoking the ‘skip-gram’ method as opposed to the second method more commonly used, the ‘cbow’ (continuous bag of words) method.

Like any central Python function, you should study this one to learn more about some of the other settable parameters. What is most important, however, is to learn about these settings, test those you deem critical, and realize fine-tuning such parameters is likely the key to successful results with your machine learning efforts. It is a common secret that success with machine learning is dependent on setting up and then tweaking the parameters that go into any particular method.

We can take this same code block above and set up the doc2vec method:

import sys
from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument
from gensim.models import Word2Vec
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-trigram.txt'
out_model = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-d2v.model'
out_vec = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-d2v.vec'
input = smart_open(in_f, 'r', encoding='utf-8')

documents = TaggedLineDocument(input)
training_src = list(documents)
print(training_src[:1])
model = Doc2Vec(vector_size=300, min_count=15, epochs=30)
model.build_vocab(training_src)
model.train(training_src, total_examples=model.corpus_count, epochs=model.epochs)
model.save(out_model)
model.save_word2vec_format(out_vec, binary=False)
print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))

The doc2vec method has a similar setup. The main difference is that the vector calculation is now based on full sentences versus individual words. We also increase the min_count parameter. We’ll see the results of this training in the next section.

gensim also has methods to train FastText. Please consult the documentation for this method as well as to understand better the various training parameters.

Similarity Analysis

A good way to see the effect of embedding vectors is through similarity analysis. The calculations are based on the adjacency of vectors in the embedding space.

Our first two examples use word2vec for our newly created KBpedia-Wikipedia corpus. The first example calculates the relatedness between two entered terms:

from gensim.models import Word2Vec

path = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-w2v.model'
model = Word2Vec.load(path)
model.wv.similarity('man', 'woman')

0.6882485

The second example retrieves the most closely related terms given an input term or phrase (in this case, machine_learning:

from gensim.models import Word2Vec

path = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-w2v.model'
model = Word2Vec.load(path)
w1 = ['machine_learning']
model.wv.most_similar(positive=w1, topn=6)

[('artificial_intelligence', 0.7738381624221802),
 ('data_mining', 0.7659739255905151),
 ('algorithms', 0.7430499792098999),
 ('natural_language_processing', 0.7429415583610535),
 ('computational', 0.7116029262542725),
 ('computational_linguistics', 0.6903550028800964)]

gensim offers a number of settings including whether one can analyze without training (effectively a ‘read only’ option) and other parameters including number of results returned, etc.

We can also compare the doc2vec approach in comparison to word2vec:

from gensim.models import Doc2Vec

path = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-d2v.model'
model = Doc2Vec.load(path)
w1 = ['machine_learning']
model.wv.most_similar(positive=w1, topn=6)

[('quantitative_methods', 0.4042782783508301),
 ('artificial_intelligence', 0.3983246088027954),
 ('evolutionary_computation', 0.39264559745788574),
 ('information_retrieval', 0.38776731491088867),
 ('natural_language_processing', 0.38531848788261414),
 ('deep_learning', 0.37803560495376587)]

Note we get a similar listing of results, though the correlation scores in this doc2vec case are much lower.

These efforts conclude our embedding tests for the moment. We will be adding additional embeddings based on knowledge graph structure and annotations in CWPK #67.

Text Summarization

Let’s now switch gears and introduce our basic natural language processing package, spaCy. Out-of-the-box spaCy includes the standard NLP utilities of part-of-speech tagging, lemmatization, dependency parsing, named entity recognition, entity linking, tokenization, merging and splitting, and sentence segmentation. Various vector embedding or rule-based processing methods may be layered on top of these utilities, and they may be combined into flexible NLP processing pipelines.

We are not doing anything special here, but I wanted to include text summarization because it nicely combines many functions and utilities provided by the spaCy package. Here is an example using an existing spaCy model, en_core_web_sm, which has pre-calculated POS and NER tags based on the English OntoNotes 5 corpus. (You will need to separately download and install these existing models.) The text to be evaluated was copied-and-pasted from CWPK #61:

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
import en_core_web_sm

extra_words = list(STOP_WORDS) + list(punctuation) + ['\n']
nlp = en_core_web_sm.load()
doc = """With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, "data science") discussion, and the last part where significant code is developed and documented. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.
      Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.
      KBpedia's (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate 'slices' or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.
      Second, 80% of KBpedia's concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic 'signals'. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.
      And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The 'deep' appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.
      The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at 'standard' machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.
      The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.
      So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python 'ecosystems' and 'frameworks' in this part is to be better prepared to incorporate those innovations and learnings to come.
      One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.
 Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:  There are many possible diagrams that one might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of 'classification' is a supervised one, 'clustering' a notion of unsupervised.
 will include a 'standard' machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.
 All machine learners need to operate on their feature spaces in numerical representations. Text is a tricky form because language is difficult and complex, and how to represent the tokens within our language usable by a computer needs to consider, what? Parts-of-speech, the word itself, sentence construction, semantic meaning, context, adjacency, entity recognition or characterization? These may all figure into how one might represent text. Machine learning has brought us unsupervised methods for converting words to sentences to documents and, now, graphs, to a reduced, numeric representation known as "embeddings." The embedding method may capture one or more of these textual or structural aspects.
 Much of the first interest in machine learning based on graphs was driven by these interests in embeddings for language text. Standard machine classifiers with deep learning using neural networks have given us word2vec, and more recently BERT and its dozens of variants have reinforced the usefulness of deep learning to create pre-trained text representations.
 Indeed, embeddings do figure prominently in knowledge graph representation, but only as one among many useful features. Knowledge graphs with hierarchical (subsumption) relationships, as might be found in any taxonomy, become directed. Knowledge graphs are asymmetrical, and often multi-typed and sometimes multi-modal. There is heterogeneity among nodes and links or edges. Not all knowledge graphs are created equal and some of these aspects may not apply. Whether there is an accompanying richness of text description that accompanies the node or edges is another wrinkle. None of the early CNN or RNN or simple neural net approaches match well with these structures.
 The general category that appears to have emerged for this scope is geometric deep learning, which applies to all forms of graphs and manifolds. There are other nuances in this area, for example whether a static representation is the basis for analysis or one that is dynamic, essentially allowing learning parameters to be changed as the deep learning progresses through its layers. But GDL has the theoretical potential to address and incorporate all of the wrinkles associated with heterogeneous knowledge graphs.
      So, this discussion helps define our desired scope. We want to be able to embrace Python packages that range from simple statistics to simple machine learning, throwing in natural language processing and creating embedding representations, that can then range all the way through deep learning to the cutting-edge aspects of geometric or graph deep learning.
      This background provides the necessary context for our investigations of Python packages, frameworks, or libraries that may fulfill the data science objectives of this part. Our new components often build upon and need to play nice with some of the other requisite packages introduced in earlier installments, including pandas ([CWPK #55]), NetworkX ([CWPK #56]), and PyViz ([CWPK #55]). NumPy has been installed, but not discussed.
      It is not fair to say that natural language processing has become a 'commodity' in the data science space, but it is also true there is a wealth of capable, complete packages within Python. There are standard NLP requirements like text cleaning, tokenization, parts-of-speech identification, parsing, lemmatization, phrase identification, and so forth. We want these general text processing capabilities since they are often building blocks and sometimes needed in their own right. We also would like to add to this baseline such considerations as interoperability, creating embeddings, or other special functions.
 Another key area is language embedding. Language embeddings are means to translate language into a numerical representation for use in downstream analysis, with great variety in what aspects of language are captured and how to craft them. The simplest and still widely-used representation is tf-idf (term frequency–inverse document frequency) statistical measure. A common variant after that was the vector space model. We also have latent (unsupervised) models such as LDA. A more easily calculated option is explicit semantic analysis (ESA). At the word level, two of the prominent options are word2vec and gloVe, which is used directly in spaCy. These have arisen from deep learning models. We also have similar approaches to represent topics (topicvec), sentences (sentence2vec), categories and paragraphs (Category2Vec), documents (doc2vec), node2vec or entire languages (BERT and variants and GPT-3 and related methods). In all of these cases, the embedding consists of reducing the dimensionality of the input text, which is then represented in numeric form.
 There are internal methods for creating embeddings in multiple machine learning libraries. Some packages are more dedicated, such as fastText, which is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. Another option is TextBrewer, which is an open-source knowledge distillation toolkit based on PyTorch and which uses (among others) BERT to provide text classification, reading comprehension, NER or sequence labeling.
 Closely related to how we represent text are corpora and datasets that may be used either for reference or training purposes. These need to be assembled and tested as well as software packages. The availability of corpora to different packages is a useful evaluation criterion. But, the picking of specific corpora depends on the ultimate Python packages used and the task at hand. We will return to this topic in CWPK #63.
 Of course, nearly all of the Python packages mentioned in this Part VI have some relation to machine learning in one form or another. I call out this category separately because, like for NLP, I think it makes sense to have a general machine learning library not devoted to deep learning but providing a repository of classic learning methods.
 There really is no general option that compares with scikit-learn. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means and DBSCAN data clustering, and is designed to interoperate with NumPy and SciPy. The project is extremely active with good documentation and examples.
 Deep learning is characterized by many options, methods and philosophies, all in a fast-changing area of knowledge. New methods need to be compared on numerous grounds from feature and training set selection to testing, parameter tuning, and performance comparisons. These realities have put a premium on libraries and frameworks that wrap methods in repeatable interfaces and provide abstract functions for setting up and managing various deep (and other) learning algorithms.
 The space of deep learning thus embraces many individual methods and forms, often expressed through a governing ecosystem of other tools and packages. These demands lead to a confusing and overlapping and non-intersecting space of Python options that are hard to describe and comparatively evaluate. Here are some of the libraries and packages that fit within the deep and machine learning space, including abstraction frameworks:
 Theano is a Python library and optimizing compiler for manipulating and evaluating mathematical expressions, especially matrix-valued ones; it is tightly integrated with NumPy, and uses it at the lowest level.
 Keras is increasingly aligning with TensorFlow and some, like Chainer and CNTK, are being deprecated in favor of the two leading gorillas, PyTorch and TensorFlow. One approach to improve interoperability is the Open Neural Network Exchange (ONNX) with the repository available on GitHub. There are existing converters to ONNX for Keras, TensorFlow, PyTorch and scikit-learn.
 A key development from deep learning of the past three years has been the usefulness of Transformers, a technique that marries encoders and decoders converging on the same representation. The technique is particularly helpful to sequential data and NLP, with state-of-the-art performance to date for: next-sentence prediction, question answering, reading comprehension, sentiment analysis, and paraphrasing.
      Both BERT and GPT are pre-trained products that utilize this method. Both TensorFlow and PyTorch contain Transformer capabilities.
 As noted, most of my research for this Part VI has resided in the area of a subset of deep graph learning applicable to knowledge graphs. The leading deep learning libraries do not, in general, provide support for this area of representational learning, sometimes called knowledge representation learning (KRL) or knowledge graph embedding (KGE). Within this rather limited scope, most options also seem oriented to link prediction and knowledge graph completion (KGC), rather than the heterogeneous aspects with text and OWL2 orientation characteristic of KBpedia.
 We thus see this rough hierachy: machine learning → deep learning → geometric deep learning → graph (R) learning → KG learning
 Lastly, more broadly, there is the recently announced KGTK, which is a generalized toolkit with broader capabilities for large scale knowledge graph manipulation and analysis. KGTK also puts forward a standard KG file format, among other tools.
 A Generalized Python Data Science Architecture
 With what we already have in hand, plus the libraries and packages described above, we have a pretty good inventory of candidates to choose from in proceeding with our next installments. Like our investigations around graphics and visualization (see [CWPK #55]), the broad areas of data science, machine learning, and deep learning have been evolving to one of comprehensive ecosystems. Figure 2 below presents a representation of the Python components that make sense for the machine learning and application environment. As noted, our local Windows machines lack separate GPUs (graphical processing units), so the hardware is based on a standard CPU (which has an integrated GPU that can not be separately targeted). We have already introduced and discusses some of the major Python packages and libraries, including pandas, NetworkX, and PyViz. Here is that representative data science architecture:
 Representative Python Components
 The defining architectural question for this Part VI is what general deep and machine learning framework we want (if any). I think using a framework makes sense over scripting together individual packages, though for some tests that still might be necessary. If I was to adopt a framework, I would also want one that has a broad set of tools in its ecosystem and common and simpler ways to define projects and manage the overall pipelines from data to results. As noted, the two major candidates appear to be TensorFlow and PyTorch.
 TensorFlow has been around the longest, has, today, the strongest ecosystem, and reportedly is better for commercial deployments. Google, besides being the sponsor, uses TensorFlow in most of its ML projects and has shown a commitment to compete with the upstart PyTorch by significantly re-designing and enhancing TensorFlow 2.0.
 On the other hand, I very much like the more 'application' orientation of PyTorch. Innovation has been fast and market share has been rising.
 Though some of the intriguing packages for TensorFlow are not apparently available for PyTorch, including Graph Nets, Keras, Plaid ML, and StellarGraph, PyTorch does have these other packages not yet mentioned that look potentially valuable down the road:
      One disappointment is that neither of these two leading packages directly ingest [RDFLib] graph files, though with PyTorch and DGL you can import or export a NetworkX graph directly. pandas is also a possible data exchange format.
      Consideration of all of these points has led us to select PyTorch as the initial data science framework. It is good to know, however, that a fairly comparable alternative also exists with TensorFlow and Keras.
      Finally, with respect to Figure 2 above, we have no plans at present to use the Dask package for parallelizing analytic calculations.
      With the PyTorch decision made, at least for the present, we are now clear to deal with specific additional packages and libraries. I highlight four of these in this section. Each of these four is the focus of two separate installments as we work to complete this Part VI. One of these four is in natural language processing (spaCy), one in general machine learning (scikit-learn), and two in deep learning with an emphasis on graphs (DGL and DGL-KE, and PyG). These choices again tend to reinforce the idea of evaluating whole ecosystems, as opposed to single packages. Note, of course, that more specifics on these four packages will be presented in the forthcoming installments.
      I find spaCy to be very impressive, with many potentially useful extensions including sense2vec, spacy-stanza, spacy-wordnet, torchtext, and GenSim.
      The major competitor is NLTK. The reputation of this package is stellar and it has proven itself for decades. It is a more disaggregate approach by scholars and researchers to enable users to build complex NLP functionality. It is therefore harder to use and configure, and is also less performant. The real differentiator, however, is the more object or application orientation of spaCy.
      Though NLTK appears to have good NLP tools for processing data pipelines using text, most of these functions appear to be in spaCy and there are also the Flair and PyTorch-NLP packages available in the PyTorch environment if needed. GenSim looks to be a very useful enhancement to the environment because of the advanced text evaluation modes it offers, including sentiment. Not all of these will be tested during this CWPK series, but it will be good to have these general capabilities resident in cowpoke.
      We earlier signaled our intent to embrace scikit-learn, principally to provide basic machine learning support. scikit-learn provides a unified API to these basic tasks, including crafting pipelines and meta-functions to integrate the data flow. scikit-learn works on any numeric data stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.
      Some of the general ML methods, and there are about 40 supervised ones in the package, may be useful and applicable to specific circumstances include: dimensionality reduction, model testing, preprocessing, scoring methods, and principal component analysis (PCA).
      A real test of this package will be ease of creating (and then being able to easily modify) data processing and analysis pipelines. Another test will be ingesting, using, and exporting data formats useful to the KBpedia knowledge graph. We know that scikit-learn doesn't talk directly to NetworkX, though there may be recipes for the transfer; graphs are represented in scikit-learn as connectivity matrices. pandas can interface via common formats including CSV, Excel, JSON and SQL, or, with some processing, DataFrames. scikit-learn supports data formats from NumPy and SciPy, and it supports a datasets.load_files format that may be suitable for transferring many and longer text fields. One option that is intriguing is how to leverage the CSV flat-file orientation of our KG build and extract routines in cowpoke for data transfer and transformation.
      I also want to keep an eye on the possible use of skorch to better integrate with the overall PyTorch environment, or to add perhaps needed and missing functionality or ease of development. There is much to explore with these various packages and environments.
      For our basic, 'vanilla', deep graph analysis package we have chosen the eponymous Deep Graph Library for basic graph neural network operations, which may run on CPU or GPU machines or clusters. The better interface relevant to KBpedia is through DGL-KE, a high performance, reportedly easy-to-use, and scalable package for learning large-scale knowledge graph embeddings that extends DGL. DGL-KE also comes configured with the popular models of TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE.
      PyTorch Geometric is closely tied to PyTorch, and most impressively has uniform wrappers to about 40 state-of-art graph neural net methods. The idea of "message passing" in the approach means that heterogeneous features such as structure and text may be combined and made dynamic in their interactions with one another. Many of these intrigued me on paper, and now it will be exciting to test and have the capability to inspect these new methods as they arise. DeepSNAP may provide a direct bridge between NetworkX and PyTorch Geometric.
      During the research on this Part VI I encountered a few leads that are either not ready for prime time or are off scope to the present CWPK series. A potentially powerful, but experimental approach that makes sense is to use SPARQL as the request-and-retrieval mechanism against the graph to feed the machine learners. RDFFrames provides an imperative Python API that gets internally translated to SPARQL, and it is integrated with the PyData machine learning software stack; see GitHub. Some methods above also use SPARQL. One of the benefits of a SPARQL approach, besides its sheer query and inferencing power, is the ability to keep the knowledge graph intact without data transform pipelines. It is quite available to serve up results in very flexible formats. The relative immaturity of the approach and performance considerations may be difficult challenges to overcome.
      I earlier mentioned KarateClub, a Python framework combining about 40 state-of-the-art unsupervised graph mining algorithms in the areas of node embedding, whole-graph embedding, and community detection. It builds on the packages of NetworkX, PyGSP, Gensim, NumPy, and SciPy. Unfortunately, the package does not support directed graphs, though plans to do so have been stated. This project is worth monitoring.
      A third intriguing area involves the use of quaternions based on Clifford algebras in their machine learning codes. Charles Peirce, the intellectual guide for the design of KBpedia, was a mathematician of some renown in his own time, and studied and applauded William Kingdon Clifford and his emerging algebra as a contemporary in the 1870s, shortly before Clifford's untimely death. Peirce scholars have often pointed to this influence in the development of Peirce's own algebras. I am personally interested in probing this approach to learn a bit more of Peirce's manifest interests.
      """
docx = nlp(doc)
all_words = [word.text for word in docx]
Freq_word = {}
for w in all_words:
    w1 = w.lower()
    if w1 not in extra_words and w1.isalpha():
        if w1 in Freq_word.keys():
              Freq_word[w1] += 1
        else:
              Freq_word[w1] = 1

These spaCy models have to be separately loaded. In our case, we first needed to download and install the model package:

Then, we needed to import the model as shown. To use a different model one would need to import and load that model separately.

Let’s go ahead and run this routine to get the word frequencies in the input text:

Freq_word

We can also get an estimate of the overall topic for our input text:

val=sorted(Freq_word.values())
max_freq = val[-3:]
print('Topic of document given :-')
for word,freq in Freq_word.items():
    if freq in max_freq:
        print(word ,end = ' ')
    else:
        continue

Topic of document given :-
machine learning deep

We can now proceed to begin our text summarization by scoring the relevance of the sentences in the input text:

for word in Freq_word.keys():
    Freq_word[word] = (Freq_word[word] / max_freq[-1])
sent_strength = {}
for sent in docx.sents:
    sen_len = len(sent)
    if sen_len >= 8:
        for word in sent :
            if word.text.lower() in Freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent] += Freq_word[word.text.lower()]
                else:
                    sent_strength[sent] = Freq_word[word.text.lower()]
            else: 
                continue
    else:
        sent_strength[sent] = 0
top_sentences = (sorted(sent_strength.values())[::-1])
top5percent_sentence = int(0.05 * len(top_sentences))
top_sent = top_sentences[:top5percent_sentence]

And then, based on those scores, to generate a summary based on the top 5 percent of sentences in the input text:

summary = []
for sent,strength in sent_strength.items():
    if strength in top_sent:
        summary.append(sent)
    else:
        continue
for i in summary:
    print(i, end='')

And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning.We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries.Standard machine classifiers with deep learning using neural networks have given us word2vec, and more recently BERT and its dozens of variants have reinforced the usefulness of deep learning to create pre-trained text representations.
      We want to be able to embrace Python packages that range from simple statistics to simple machine learning, throwing in natural language processing and creating embedding representations, that can then range all the way through deep learning to the cutting-edge aspects of geometric or graph deep learning.
      I call out this category separately because, like for NLP, I think it makes sense to have a general machine learning library not devoted to deep learning but providing a repository of classic learning methods.
      The leading deep learning libraries do not, in general, provide support for this area of representational learning, sometimes called knowledge representation learning (KRL) or knowledge graph embedding (KGE).→ deep learning → geometric deep learning → graph (R) learningLike our investigations around graphics and visualization (see [CWPK #55]), the broad areas of data science, machine learning, and deep learning have been evolving to one of comprehensive ecosystems.One of these four is in natural language processing (spaCy), one in general machine learning (scikit-learn), and two in deep learning with an emphasis on graphs (DGL and DGL-KE, and PyG).

We can vary the size of the summarization by varying the percentage of sentences to be included. Note there are other methods available for summarizing text using spaCy. A standard search will turn up other methods than simply calculating top sentences based on word frequencies.

(Named) Entity Recognition – Update Model

Depending on the input model, spaCy provides pre-trained entity recognition models (the area is most often refered to as NER, but an actual category of entities need not be named with capitalization, which is why I prefer ‘entity recognition’). In the case of the pre-trained English OntoNotes 5 model, these entity tags are:

Type	Description
PERSON	People, including fictional.
NORP	Nationalities or religious or political groups.
FAC	Buildings, airports, highways, bridges, etc.
ORG	Companies, agencies, institutions, etc.
GPE	Countries, cities, states.
LOC	Non-GPE locations, mountain ranges, bodies of water.
PRODUCT	Objects, vehicles, foods, etc. (Not services.)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART	Titles of books, songs, etc.
LAW	Named documents made into laws.
LANGUAGE	Any named language.
DATE	Absolute or relative dates or periods.
TIME	Times smaller than a day.
PERCENT	Percentage, including ‘%’.
MONEY	Monetary values, including unit.
QUANTITY	Measurements, as of weight or distance.
ORDINAL	“first”, “second”, etc.
CARDINAL	Numerals that do not fall under another type.

We would like to add our own new entity tag to this list for items related to ‘machine learning’, which we will give the tag of ‘ML’. spaCy provides two methods for extending entity recognition: 1) training and updating an existing model (in this case, en_core_web_sm); or 2) a rule-based approach.

The first option we will try is the updated model. Another major section below investigates the rule-based approach.

Get Entity Pages

We already have a number of ‘machine learning’-related reference concepts in KBpedia. Given our success in using the Wikipedia online API to get articles (see prior installment), I decide to explore that API to see if there are ways to get comprehensive listings of ‘machine learning’ topics. Happily, it turns out, there are!

This particular API call, https://www.mediawiki.org/wiki/API:Categorymembers, enables us to enter a category term, in this case ‘machine learning’, and to get all of the article titles subsumed in that category. What is also fantastic is that we can also get various code snippets to interact with this API, including using Python. Here is the code listing that we get from the API:

#!/usr/bin/python3

"""
    get_category_items.py

    MediaWiki API Demos
    Derived from demo of `Categorymembers` module : List twenty items in a category

    MIT License
"""

import requests

S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "query",
    "cmtitle": "Category:Unsupervised learning",
    "cmlimit": "300",
    "list": "categorymembers",
    "format": "json"
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()

PAGES = DATA['query']['categorymembers']

for page in PAGES:
    print(page['title'])

For each appropriate sub-category under ‘machine learning’ we issue the query above to get a listing of possible articles on the topic in Wikipedia. We assemble up this listing and then manually inspect it to remove things like Category:Machine learning researchers, since those are people related to machine learning but not ‘machine learning’ per se. The result of this retrieval, including all relevant subdirectories, yields 1027 pages, 853 of which are unique, and 846 of which actually process.

This listing gives us two kinds of items. First, the page titles give us terms and phrases related to ‘machine learning’. Second, using the same procedures for Wikipedia page retrievals noted in CWPK #63, we retrieve the actual XML pages, clean them in the same way we did for the general KBpedia corpus, and then create bigrams and trigrams. We now have our specialty ‘machine learning’ corpus in the exact same format as that for KBpedia, which we keep and save as wp_machine_learning.txt.

Chunk Text into ‘Sentences’

Since the spaCy NER trainer relies on sentence-length snippets for its training (see below), our next step is to chop up this text corpus into sentence-length chunks. The code below, including the textwrap package import, enables us to iterate document-by-document through our wp_machine_learning.txt file and to break it into sentence-length chunks. The example below chunks into snippets that are 48 characters long:

import textwrap
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning_sentences_48.txt'

documents = smart_open(in_f, 'r', encoding='utf-8')

with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in documents:
        try:
            line = str(line)
            sentences = textwrap.wrap(line, 48)
            sentences = str(sentences)
            sentences = sentences + '\n'
            output.write(sentences)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
    output.close()
    print('Split lines into sentences for ' + str(i) + ' articles;')
    print('Processing complete!')

In my various experiments, I set chunk sizes ranging from 48 characters to 180 characters, as I discuss in later results.

Extract and Offset Entities

The titles we extracted from Wikipedia ‘machine learning’ articles give us the set terms and phrases for setting up the labeled examples expected by the spaCy NER trainer. Here is one example:

"a generative model it is now known as a", {"entities": [(2, 18, "ML")]}

This labeled example shows a text pattern in which the entity (‘generative model’ in this case) is embedded, with the starting and ending character offsets specified for that entity, as well as its ‘ML’ label. (Note that the offset counter begins at zero given the Python convention.) One needs to provide hundreds of such labeled examples to properly train the entity recognizer.

Manually labeling these snippets is an intense and time-consuming task. To make it efficient, we use a spaCy function called PhraseMatcher that inspects each snippet, identifies whether a stipulated entity occurs there, and returns the offset character number where it matches. Thus, in the code below, we first list out the 800 or so ‘machine learning’ titles we have already identified, and then parse those against the sentence snippets we generated from our article texts:

# adapted from https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types
# https://thinkinfi.com/prepare-training-data-and-train-custom-ner-using-spacy-python/
# https://adagrad.ai/blog/ner-using-spacy.html

import os
import spacy
import random
from spacy.matcher import PhraseMatcher
from spacy.tokenizer import Tokenizer
from smart_open import smart_open
import en_core_web_sm

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning_sentences_48.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning_training_data_48.json'
documents = smart_open(in_f, 'r', encoding='utf-8')
nlp = en_core_web_sm.load()

ml_patterns = [nlp(text) for text in ('(1+ε)-approximate nearest neighbor search', 
     '80 million tiny images', 'ablation', 'absorbing markov chain')]
# See the full listing under '(Named) Entity Recognition - Rule-Based' section below                                     


matcher = PhraseMatcher(nlp.vocab)
matcher.add('ML', None, *ml_patterns)
with open(out_f, 'w', encoding='utf-8') as output:
    x = 0
    for line in documents:
        line = str(line)
        sublist = line.split(', ')
        for sentence in sublist:
            sentence = str(sentence)
            sentence = sentence.replace(']','')
            sentence = sentence.replace("'", "")
            doc = nlp(sentence)
            sen_length = len(sentence)
            matches = matcher(doc)
            start = 0
            s_start = 0
            s_length = 0
            for match_id, start, end in matches:  # iterate over the entities
                label = nlp.vocab.strings[match_id]
                span = doc[start:end]
                label = str(label)
                span = str(span)
                length = len(span)
                start = sentence.index(span)
                end = start + length
                start = str(start)
                end = str(end)
                if s_start != start and s_length != length:
                    s_start = start
                    s_length = length
                    train_it = ('("' + sentence + '", {"entities": [(' + start + ', ' + end 
                           + ', "' + label + '")]}),')
                    output.write(train_it)
                else:
                    continue
    output.close()
    print('Got this far!')

The code block at the end of this routine calculates the starting and ending character offsets for the matched entity, and then constructs up a new string that matches the form expected by spaCy. Note this training data needs to be in JSON format. There are online JSON syntax checkers (here is one) to make sure you are constructing this training example in the correct form. It takes about 8 min to do the conversion above when parsed against our machine learning corpus.

Train the Recognizer

Re-training an existing spaCy NER model means to import the existing model and update the model using the training example snippets. However, if not done properly, one can experience what is called the ‘catastrophic forgetting problem‘, which means that existing trained labels get forgotten as the new ones are learned. Two steps are recommended to limit this problem. First, the training should be limited to about 20 iterations, since repeated iterations risk more forgetting. The second recommended step is to include existing label snippets in the training corpus when training begins. This way the existing label is seen again, and the degree of ‘forgetting’ is lessened.

I looked in vain for finding the existing training examples used by spaCy for its existing entity labels. Not having success in finding such, I decided to create my own snippets with existing labels.

To do so, I repeated similar steps to what was outlined above, only now to use the existing labels rather than new ones. Like before, we also need to construct the training example in the proper JSON form:

import os
import spacy
import random
from spacy.gold import GoldParse

from smart_open import smart_open
import en_core_web_sm

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning_sentences_48.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\results\standard_training_data_48.json'
documents = smart_open(in_f, 'r', encoding='utf-8')
nlp = en_core_web_sm.load()
revision_data = []
with open(out_f, 'w', encoding='utf-8') as output:
#    for doc in nlp.pipe(revision_texts):
    for line in documents:
        line = str(line)
        sublist = line.split(', ')
        for sentence in sublist:
            sentence = str(sentence)
            sentence = sentence.replace(']','')
            sentence = sentence.replace("'", "")
            sentence = sentence.replace('[','')
            sentence = sentence.replace('\n', '')
            length = len(sentence)
            if length < 40:
                continue
            else:
                doc = nlp(sentence)
#        tags = [w.tag_ for w in doc]
#        heads = [w.head.i for w in doc]
#        deps = [w.dep_ for w in doc]
                entities = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
#        revision_data.append((doc, GoldParse(doc, tags=tags, heads=heads,
#                                            deps=deps, entities=entities)))
                revision_data.append((doc, GoldParse(doc, entities=entities)))
#        print(revision_data)
                doc = str(doc)
                entities = str(entities)
#            revision_str = (doc + ', ' + entities + '\n')
                revision_str = ('("' + doc + '", {"entities": ' + entities + '}),\n')
                output.write(revision_str)
    output.close()
    print('Got this far!')

This second pass using the existing NER labels took about 19 min.

The last step in our prep for updating the existing NER model is to remove short stubs from our training examples, as this code achieves:

from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\results\standard_training_data_48.json'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\results\standard_training_data.json'

documents = smart_open(in_f, 'r', encoding='utf-8')

with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in documents:
        try:
            line = str(line)
            length = len(line)
            if i == length or length < 40:
                continue
            else:
                i = length
                output.write(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
    output.close()
    print('Processing complete!')

Continue Training the Recognizer

We add about 10% of existing label training examples to those we have already generated for the ‘ML’ training set to overcome the ‘catastrophic forgetting failure’. With this new input deck, we are now ready to run and update the NER recognizer, making sure to keep our iterations below 20 epochs. Here is the code, including some of the 14 K training examples:

# adapted from https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/

# Import and load the spacy model
import spacy
import en_core_web_sm
import random
from spacy.util import minibatch, compounding

output_dir = r'C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml'
 
nlp = en_core_web_sm.load()

# Get the ner component
ner=nlp.get_pipe('ner')

# Add new label
LABEL = 'ML'

# Add the label to ner
ner.add_label(LABEL)

# Load training data 
TRAIN_DATA = [("ökonomik path dependence in spatial networks the", {"entities": [(9, 24, "ML")]}),
("β with respect to the loss function v if the", {"entities": [(22, 35, "ML")]}),
("ε approximate nearest neighbor search include kd", {"entities": [(14, 37, "ML")]}),
("ξ i e φ x is the feature vector produced for a", {"entities": [(17, 24, "ML")]}),
("a a survey on concept drift adaptation acm", {"entities": [(14, 27, "ML")]}),
("a bayesian gaussian mixture model is commonly", {"entities": [(20, 33, "ML")]}),
("a beginner s guide to factor analysis", {"entities": [(22, 37, "ML")]}),
("a boltzmann machine with a few weights labeled", {"entities": [(2, 19, "ML")]}),
# See 'C:\1-PythonProjects\kbpedia\v300\models\inputs\ml_training_data.json' for full listing

# Retrieve labels
for _, annotations in TRAIN_DATA:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])
            
# Resume training (since we are extending an existing model)
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List pipes needed for training
pipe_exceptions = ['ner', 'trf_wordpiecer', 'trf_tok2vec']

# List all other pipes
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]  

# Begin training on restricted pipeline components
with nlp.disable_pipes(*other_pipes):
    random.seed(0)
    sizes = compounding(1.0, 4.0, 1.001)
# Iterate over training set 20 time     
    for itn in range(20):
# Shuffle examples
        random.shuffle(TRAIN_DATA)
# Grab a batch of examples for training
        batches = minibatch(TRAIN_DATA, size=sizes)
# Set empty dictionary
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
# Call update() over epoch; see narrative
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            print('Losses: ', losses)
# Save model to disk
    nlp.meta['name'] = 'ml_sm'  # rename model
    nlp.to_disk(output_dir)
    print('Saved model to: ', output_dir)

When run, a loss calculation is made for each training input example, which gets repeated in full over the number of iterations specified (20 in this case). (Which takes about 4:15 hrs to run on my laptop.) Also note we are saving out the new trained model, en_core_ml, at the completion of the training.

Unfortunately, though my testing showed we were picking up the ‘ML’ tag, it was not doing so for many of the input training examples. To see if I could improve on this performance, I tested the following options, all to no material benefit:

Reducing the number of training examples
Reducing the number of entity patterns
Varying the sentence snippet size, from 48 characters (more typical of online examples) to 180 characters (more typical of actual sentence length)
Dialing back the number of iterations (even a single iteration showed the forgetting behavior)
Testing the drop setting between 0.5 and 0.0
Changing the relative contribution of existing NER labels to the training set
Using or not negative (no entity matches) training examples.

In all cases, I continued to see the ‘forgetting’ problem and observed that larger numbers of iterations adversely affected the number of ‘ML’ labels assigned. These results are not in accordance with the documentation I have found. Possible reasons for this continued poor performance might include:

Too many diverse patterns in the training set
Issues possibly arising from the synthetic sentence snippets I generated
An inadquate percentage of existing label training snippets, or
A coding error.

(Named) Entity Recognition – Rule-Based

Unlike certain named entities like persons or organizations or locations, the number of ‘machine learning’ instances is more bounded and finite. Since I had already captured the nearly complete entity aspects of the space through the thousand Wikipedia examples, perhaps I did not need a trained model, but one more based on rules and set patterns.

Fortunately, spaCy has such a capability called EntityRuler. It provides rule-based matching for assigning label tags to text. Given that I had been unable to better tune the training model, I decided to test this option as well.

Like the training models, the rule-based approach records a set of known input patterns to provide the matches to text and then labeling. So, aside from the need to assemble the known patterns, which we had already done above, the actual code to invoke this option is rather simple and straightforward:

from spacy.pipeline import EntityRuler
import en_core_web_sm

output_dir = r'C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml'
 
nlp = en_core_web_sm.load()

ruler = EntityRuler(nlp, overwrite_ents=True)

patterns = [{'label': 'ML', 'pattern': 'nearest neighbor search'},
{'label': 'ML', 'pattern': '80 million tiny images'},
{'label': 'ML', 'pattern': 'ablation'},
{'label': 'ML', 'pattern': 'absorbing markov chain'},
{'label': 'ML', 'pattern': 'action model learning'},
{'label': 'ML', 'pattern': 'activation function'},
{'label': 'ML', 'pattern': 'active learning'},
{'label': 'ML', 'pattern': 'activity recognition'},
{'label': 'ML', 'pattern': 'adaboost'},
{'label': 'ML', 'pattern': 'adagrad'},
{'label': 'ML', 'pattern': 'adaline'},
{'label': 'ML', 'pattern': 'adaptive neuro fuzzy inference system'},
{'label': 'ML', 'pattern': 'adaptive resonance theory'},
{'label': 'ML', 'pattern': 'additive smoothing'},
{'label': 'ML', 'pattern': 'adversarial machine learning'},
{'label': 'ML', 'pattern': 'aixi'},
{'label': 'ML', 'pattern': 'alchemyapi'},
{'label': 'ML', 'pattern': 'alexnet'},
{'label': 'ML', 'pattern': 'algorithm selection'},
{'label': 'ML', 'pattern': 'algorithmic bias'},
{'label': 'ML', 'pattern': 'algorithmic composition'},
{'label': 'ML', 'pattern': 'algorithmic inference'},
{'label': 'ML', 'pattern': 'algorithmic learning theory'},
{'label': 'ML', 'pattern': 'algorithms of oppression'},
{'label': 'ML', 'pattern': 'almeida–pineda recurrent backpropagation'},
{'label': 'ML', 'pattern': 'alopex'},
{'label': 'ML', 'pattern': 'alphago'},
{'label': 'ML', 'pattern': 'alphago zero'},
{'label': 'ML', 'pattern': 'alphastar'},
{'label': 'ML', 'pattern': 'alphazero'},
{'label': 'ML', 'pattern': 'alterego'},
{'label': 'ML', 'pattern': 'alternating decision tree'},
{'label': 'ML', 'pattern': 'analogical modeling'},
{'label': 'ML', 'pattern': 'anomaly detection'},
{'label': 'ML', 'pattern': 'anti-unification'},
{'label': 'ML', 'pattern': 'apprenticeship learning'},
{'label': 'ML', 'pattern': 'archetypal analysis'},
{'label': 'ML', 'pattern': 'artificial development'},
{'label': 'ML', 'pattern': 'artificial intelligence system'},
{'label': 'ML', 'pattern': 'artificial neural network'},
{'label': 'ML', 'pattern': 'artificial neuron'},
{'label': 'ML', 'pattern': 'artisto'},
{'label': 'ML', 'pattern': 'associative classifier'},
{'label': 'ML', 'pattern': 'astrostatistics'},
{'label': 'ML', 'pattern': 'augmented analytics'},
{'label': 'ML', 'pattern': 'autoassociative memory'},
{'label': 'ML', 'pattern': 'autoencoder'},
{'label': 'ML', 'pattern': 'automated machine learning'},
{'label': 'ML', 'pattern': 'automated pain recognition'},
{'label': 'ML', 'pattern': 'averaged one-dependence estimators'},
{'label': 'ML', 'pattern': 'backpropagation'},
{'label': 'ML', 'pattern': 'bag-of-words'},
{'label': 'ML', 'pattern': 'ball tree'},
{'label': 'ML', 'pattern': 'base rate'},
{'label': 'ML', 'pattern': 'baum–welch algorithm'},
{'label': 'ML', 'pattern': 'bayesian hierarchical modeling'},
{'label': 'ML', 'pattern': 'bayesian interpretation of kernel regularization'},
{'label': 'ML', 'pattern': 'bayesian network'},
{'label': 'ML', 'pattern': 'bayesian optimization'},
{'label': 'ML', 'pattern': 'bayesian regret'},
{'label': 'ML', 'pattern': 'bayesian structural time series'},
{'label': 'ML', 'pattern': 'bcpnn'},
{'label': 'ML', 'pattern': 'behavioral clustering'},
{'label': 'ML', 'pattern': 'bernoulli scheme'},
{'label': 'ML', 'pattern': 'bias–variance tradeoff'},
{'label': 'ML', 'pattern': 'biclustering'},
{'label': 'ML', 'pattern': 'bidirectional associative memory'},
{'label': 'ML', 'pattern': 'bidirectional recurrent neural networks'},
{'label': 'ML', 'pattern': 'binary classification'},
{'label': 'ML', 'pattern': 'bioz'},
{'label': 'ML', 'pattern': 'boltzmann machine'},
{'label': 'ML', 'pattern': 'bondys theorem'},
{'label': 'ML', 'pattern': 'bongard problem'},
{'label': 'ML', 'pattern': 'boosting'},
{'label': 'ML', 'pattern': 'bootstrap aggregating'},
{'label': 'ML', 'pattern': 'bradley–terry model'},
{'label': 'ML', 'pattern': 'brown clustering'},
{'label': 'ML', 'pattern': 'brownboost'},
{'label': 'ML', 'pattern': 'burst error'},
{'label': 'ML', 'pattern': 'c4.5 algorithm'},
{'label': 'ML', 'pattern': 'calibration'},
{'label': 'ML', 'pattern': 'canonical correspondence analysis'},
{'label': 'ML', 'pattern': 'capsule neural network'},
{'label': 'ML', 'pattern': 'cartesian genetic programming'},
{'label': 'ML', 'pattern': 'cascading classifiers'},
{'label': 'ML', 'pattern': 'case-based reasoning'},
{'label': 'ML', 'pattern': 'catastrophic interference'},
{'label': 'ML', 'pattern': 'category utility'},
{'label': 'ML', 'pattern': 'causal markov condition'},
{'label': 'ML', 'pattern': 'cellular evolutionary algorithm'},
{'label': 'ML', 'pattern': 'cellular neural network'},
{'label': 'ML', 'pattern': 'cerebellar model articulation controller'},
{'label': 'ML', 'pattern': 'chainer'},
{'label': 'ML', 'pattern': 'chi-square automatic interaction detection'},
{'label': 'ML', 'pattern': 'classifier chains'},
{'label': 'ML', 'pattern': 'clever score'},
{'label': 'ML', 'pattern': 'cluster analysis'},
{'label': 'ML', 'pattern': 'clustering high-dimensional data'},
{'label': 'ML', 'pattern': 'clustering illusion'},
{'label': 'ML', 'pattern': 'cma-es'},
{'label': 'ML', 'pattern': 'cn2 algorithm'},
{'label': 'ML', 'pattern': 'co-training'},
{'label': 'ML', 'pattern': 'coboosting'},
{'label': 'ML', 'pattern': 'codi'},
{'label': 'ML', 'pattern': 'cognitive computer'},
{'label': 'ML', 'pattern': 'cognitive robotics'},
{'label': 'ML', 'pattern': 'collostructional analysis'},
{'label': 'ML', 'pattern': 'committee machine'},
{'label': 'ML', 'pattern': 'common-method variance'},
{'label': 'ML', 'pattern': 'competitive learning'},
{'label': 'ML', 'pattern': 'compositional pattern-producing network'},
{'label': 'ML', 'pattern': 'computational cybernetics'},
{'label': 'ML', 'pattern': 'computational learning theory'},
{'label': 'ML', 'pattern': 'computational neurogenetic modeling'},
{'label': 'ML', 'pattern': 'computer-automated design'},
{'label': 'ML', 'pattern': 'concept class'},
{'label': 'ML', 'pattern': 'concept drift'},
{'label': 'ML', 'pattern': 'concept learning'},
{'label': 'ML', 'pattern': 'conceptual clustering'},
{'label': 'ML', 'pattern': 'conditional random field'},
{'label': 'ML', 'pattern': 'confabulation'},
{'label': 'ML', 'pattern': 'confusion matrix'},
{'label': 'ML', 'pattern': 'connectionist temporal classification'},
{'label': 'ML', 'pattern': 'consensus clustering'},
{'label': 'ML', 'pattern': 'constellation model'},
{'label': 'ML', 'pattern': 'constrained clustering'},
{'label': 'ML', 'pattern': 'constrained conditional model'},
{'label': 'ML', 'pattern': 'constructing skill trees'},
{'label': 'ML', 'pattern': 'constructive cooperative coevolution'},
{'label': 'ML', 'pattern': 'conversica'},
{'label': 'ML', 'pattern': 'convolutional deep belief network'},
{'label': 'ML', 'pattern': 'convolutional neural network', 'id': 'cnn'},
{'label': 'ML', 'pattern': 'CNN', 'id': 'cnn'},            
{'label': 'ML', 'pattern': 'correlation clustering'},
{'label': 'ML', 'pattern': 'correspondence analysis'},
{'label': 'ML', 'pattern': 'count sketch'},
{'label': 'ML', 'pattern': 'coupled pattern learner'},
{'label': 'ML', 'pattern': 'covers theorem'},
{'label': 'ML', 'pattern': 'cross entropy'},
{'label': 'ML', 'pattern': 'cross-validation'},
{'label': 'ML', 'pattern': 'cultural algorihm'},
{'label': 'ML', 'pattern': 'curse of dimensionality'},
{'label': 'ML', 'pattern': 'darkforest'},
{'label': 'ML', 'pattern': 'darpa lagr program'},
{'label': 'ML', 'pattern': 'darwintunes'},
{'label': 'ML', 'pattern': 'data augmentation'},
{'label': 'ML', 'pattern': 'data exploration'},
{'label': 'ML', 'pattern': 'data pre-processing'},
{'label': 'ML', 'pattern': 'datasets.load'},
{'label': 'ML', 'pattern': 'decision boundary'},
{'label': 'ML', 'pattern': 'decision list'},
{'label': 'ML', 'pattern': 'decision tree learning'},
{'label': 'ML', 'pattern': 'decision tree pruning'},
{'label': 'ML', 'pattern': 'deductive classifier'},
{'label': 'ML', 'pattern': 'deep belief network'},
{'label': 'ML', 'pattern': 'deep image prior'},
{'label': 'ML', 'pattern': 'deep instinct'},
{'label': 'ML', 'pattern': 'deep lambertian networks'},
{'label': 'ML', 'pattern': 'deep learning'},
{'label': 'ML', 'pattern': 'deep learning processor'},
{'label': 'ML', 'pattern': 'deep learning studio'},
{'label': 'ML', 'pattern': 'deep reinforcement learning'},
{'label': 'ML', 'pattern': 'deepfake'},
{'label': 'ML', 'pattern': 'deepfake pornography'},
{'label': 'ML', 'pattern': 'deeplearning4j'},
{'label': 'ML', 'pattern': 'deepmind'},
{'label': 'ML', 'pattern': 'deepnude'},
{'label': 'ML', 'pattern': 'deepspeed'},
{'label': 'ML', 'pattern': 'dehaene–changeux model'},
{'label': 'ML', 'pattern': 'delta rule'},
{'label': 'ML', 'pattern': 'dendrogram'},
{'label': 'ML', 'pattern': 'dependability state model'},
{'label': 'ML', 'pattern': 'detailed balance'},
{'label': 'ML', 'pattern': 'detrended correspondence analysis'},
{'label': 'ML', 'pattern': 'developmental robotics'},
{'label': 'ML', 'pattern': 'dexnet'},
{'label': 'ML', 'pattern': 'diffbot'},
{'label': 'ML', 'pattern': 'differentiable neural computer'},
{'label': 'ML', 'pattern': 'differential evolution'},
{'label': 'ML', 'pattern': 'diffusion map'},
{'label': 'ML', 'pattern': 'dimensionality reduction'},
{'label': 'ML', 'pattern': 'discrete phase-type distribution'},
{'label': 'ML', 'pattern': 'discriminative model'},
{'label': 'ML', 'pattern': 'dispersive flies optimisation'},
{'label': 'ML', 'pattern': 'dissociated press'},
{'label': 'ML', 'pattern': 'distribution learning theory'},
{'label': 'ML', 'pattern': 'document classification'},
{'label': 'ML', 'pattern': 'domain adaptation'},
{'label': 'ML', 'pattern': 'dominance-based rough set approach'},
{'label': 'ML', 'pattern': 'doubly stochastic model'},
{'label': 'ML', 'pattern': 'dynamic bayesian network'},
{'label': 'ML', 'pattern': 'dynamic markov compression'},
{'label': 'ML', 'pattern': 'dynamic time warping'},
{'label': 'ML', 'pattern': 'dynamic topic model'},
{'label': 'ML', 'pattern': 'dynamic unobserved effects model'},
{'label': 'ML', 'pattern': 'eager learning'},
{'label': 'ML', 'pattern': 'early stopping'},
{'label': 'ML', 'pattern': 'echo state network'},
{'label': 'ML', 'pattern': 'effective fitness'},
{'label': 'ML', 'pattern': 'elastic map'},
{'label': 'ML', 'pattern': 'elastic matching'},
{'label': 'ML', 'pattern': 'elastic net regularization'},
{'label': 'ML', 'pattern': 'electricity price forecasting'},
{'label': 'ML', 'pattern': 'elmo'},
{'label': 'ML', 'pattern': 'em algorithm and gmm model'},
{'label': 'ML', 'pattern': 'empirical risk minimization'},
{'label': 'ML', 'pattern': 'end-to-end reinforcement learning'},
{'label': 'ML', 'pattern': 'ensemble learning'},
{'label': 'ML', 'pattern': 'entropy rate'},
{'label': 'ML', 'pattern': 'error tolerance'},
{'label': 'ML', 'pattern': 'error-driven learning'},
{'label': 'ML', 'pattern': 'eurisko'},
{'label': 'ML', 'pattern': 'european neural network society'},
{'label': 'ML', 'pattern': 'evaluation of binary classifiers'},
{'label': 'ML', 'pattern': 'evolution strategy'},
{'label': 'ML', 'pattern': 'evolution window'},
{'label': 'ML', 'pattern': 'evolutionary algorithm'},
{'label': 'ML', 'pattern': 'evolutionary art'},
{'label': 'ML', 'pattern': 'evolutionary multimodal optimization'},
{'label': 'ML', 'pattern': 'evolutionary music'},
{'label': 'ML', 'pattern': 'evolutionary programming'},
{'label': 'ML', 'pattern': 'evolvability'},
{'label': 'ML', 'pattern': 'evolved antenna'},
{'label': 'ML', 'pattern': 'evolving classification function'},
{'label': 'ML', 'pattern': 'examples of markov chains'},
{'label': 'ML', 'pattern': 'expectation propagation'},
{'label': 'ML', 'pattern': 'expectation–maximization algorithm'},
{'label': 'ML', 'pattern': 'explanation-based learning'},
{'label': 'ML', 'pattern': 'extension neural network'},
{'label': 'ML', 'pattern': 'extremal ensemble learning'},
{'label': 'ML', 'pattern': 'extreme learning machine'},
{'label': 'ML', 'pattern': 'f-score'},
{'label': 'ML', 'pattern': 'faceapp'},
{'label': 'ML', 'pattern': 'facial recognition system'},
{'label': 'ML', 'pattern': 'factor analysis'},
{'label': 'ML', 'pattern': 'factor regression model'},
{'label': 'ML', 'pattern': 'factored language model'},
{'label': 'ML', 'pattern': 'farthest-first traversal'},
{'label': 'ML', 'pattern': 'feature engineering'},
{'label': 'ML', 'pattern': 'feature extraction'},
{'label': 'ML', 'pattern': 'feature hashing'},
{'label': 'ML', 'pattern': 'feature learning'},
{'label': 'ML', 'pattern': 'feature scaling'},
{'label': 'ML', 'pattern': 'feature selection'},
{'label': 'ML', 'pattern': 'feature selection toolbox'},
{'label': 'ML', 'pattern': 'federated learning'},
{'label': 'ML', 'pattern': 'feed forward'},
{'label': 'ML', 'pattern': 'feedforward neural network'},
{'label': 'ML', 'pattern': 'feret database'},
{'label': 'ML', 'pattern': 'findface'},
{'label': 'ML', 'pattern': 'first-difference estimator'},
{'label': 'ML', 'pattern': 'first-order inductive learner'},
{'label': 'ML', 'pattern': 'fisher kernel'},
{'label': 'ML', 'pattern': 'fitness approximation'},
{'label': 'ML', 'pattern': 'fly algorithm'},
{'label': 'ML', 'pattern': 'formal concept analysis'},
{'label': 'ML', 'pattern': 'forward algorithm'},
{'label': 'ML', 'pattern': 'forward–backward algorithm'},
{'label': 'ML', 'pattern': 'frequent pattern discovery'},
{'label': 'ML', 'pattern': 'gated recurrent unit'},
{'label': 'ML', 'pattern': 'gaussian adaptation'},
{'label': 'ML', 'pattern': 'gaussian process'},
{'label': 'ML', 'pattern': 'gaussian process emulator'},
{'label': 'ML', 'pattern': 'gene expression programming'},
{'label': 'ML', 'pattern': 'gene prediction'},
{'label': 'ML', 'pattern': 'general regression neural network'},
{'label': 'ML', 'pattern': 'generalization error'},
{'label': 'ML', 'pattern': 'generalized canonical correlation'},
{'label': 'ML', 'pattern': 'generalized filtering'},
{'label': 'ML', 'pattern': 'generalized hebbian algorithm'},
{'label': 'ML', 'pattern': 'generalized iterative scaling'},
{'label': 'ML', 'pattern': 'generalized multidimensional scaling'},
{'label': 'ML', 'pattern': 'generative adversarial network'},
{'label': 'ML', 'pattern': 'generative model'},
{'label': 'ML', 'pattern': 'generative topographic map'},
{'label': 'ML', 'pattern': 'generec'},
{'label': 'ML', 'pattern': 'genetic algorithm'},
{'label': 'ML', 'pattern': 'genetic programming'},
{'label': 'ML', 'pattern': 'genetic representation'},
{'label': 'ML', 'pattern': 'geographical cluster'},
{'label': 'ML', 'pattern': 'gesture description language'},
{'label': 'ML', 'pattern': 'geworkbench'},
{'label': 'ML', 'pattern': 'glimmer'},
{'label': 'ML', 'pattern': 'glottochronology'},
{'label': 'ML', 'pattern': 'google brain'},
{'label': 'ML', 'pattern': 'google matrix'},
{'label': 'ML', 'pattern': 'google nest'},
{'label': 'ML', 'pattern': 'google neural machine translation'},
{'label': 'ML', 'pattern': 'gpt'},
{'label': 'ML', 'pattern': 'gpt-2'},
{'label': 'ML', 'pattern': 'gpt-3'},
{'label': 'ML', 'pattern': 'gradient boosting'},
{'label': 'ML', 'pattern': 'gramian matrix'},
{'label': 'ML', 'pattern': 'grammar induction'},
{'label': 'ML', 'pattern': 'grammatical evolution'},
{'label': 'ML', 'pattern': 'granular computing'},
{'label': 'ML', 'pattern': 'graph kernel'},
{'label': 'ML', 'pattern': 'grossberg network'},
{'label': 'ML', 'pattern': 'group method of data handling'},
{'label': 'ML', 'pattern': 'growing self-organizing map'},
{'label': 'ML', 'pattern': 'growth function'},
{'label': 'ML', 'pattern': 'handwriting recognition'},
{'label': 'ML', 'pattern': 'hard sigmoid'},
{'label': 'ML', 'pattern': 'hebbian theory'},
{'label': 'ML', 'pattern': 'helmholtz machine'},
{'label': 'ML', 'pattern': 'hidden markov model'},
{'label': 'ML', 'pattern': 'hierarchical classification'},
{'label': 'ML', 'pattern': 'hierarchical temporal memory'},
{'label': 'ML', 'pattern': 'hinge loss'},
{'label': 'ML', 'pattern': 'hopfield network'},
{'label': 'ML', 'pattern': 'horovod'},
{'label': 'ML', 'pattern': 'huber loss'},
{'label': 'ML', 'pattern': 'hybrid kohonen self-organizing map'},
{'label': 'ML', 'pattern': 'hybrid neural network'},
{'label': 'ML', 'pattern': 'hyper basis function network'},
{'label': 'ML', 'pattern': 'hyperneat'},
{'label': 'ML', 'pattern': 'hyperparameter'},
{'label': 'ML', 'pattern': 'hyperparameter optimization'},
{'label': 'ML', 'pattern': 'id3 algorithm'},
{'label': 'ML', 'pattern': 'idistance'},
{'label': 'ML', 'pattern': 'imagenets'},
{'label': 'ML', 'pattern': 'inauthentic text'},
{'label': 'ML', 'pattern': 'incremental learning'},
{'label': 'ML', 'pattern': 'independent component analysis'},
{'label': 'ML', 'pattern': 'induction of regular languages'},
{'label': 'ML', 'pattern': 'inductive bias'},
{'label': 'ML', 'pattern': 'inductive logic programming'},
{'label': 'ML', 'pattern': 'inductive probability'},
{'label': 'ML', 'pattern': 'inductive programming'},
{'label': 'ML', 'pattern': 'infer.net'},
{'label': 'ML', 'pattern': 'inferential theory of learning'},
{'label': 'ML', 'pattern': 'influence diagram'},
{'label': 'ML', 'pattern': 'infomax'},
{'label': 'ML', 'pattern': 'information fuzzy networks'},
{'label': 'ML', 'pattern': 'information gain in decision trees'},
{'label': 'ML', 'pattern': 'information gain ratio'},
{'label': 'ML', 'pattern': 'instance selection'},
{'label': 'ML', 'pattern': 'instance-based learning'},
{'label': 'ML', 'pattern': 'instantaneously trained neural networks'},
{'label': 'ML', 'pattern': 'intel realsense'},
{'label': 'ML', 'pattern': 'interacting particle system'},
{'label': 'ML', 'pattern': 'interactive activation and competition networks'},
{'label': 'ML', 'pattern': 'interactive machine translation'},
{'label': 'ML', 'pattern': 'inverted pendulum'},
{'label': 'ML', 'pattern': 'ipo underpricing algorithm'},
{'label': 'ML', 'pattern': 'ircf360'},
{'label': 'ML', 'pattern': 'isolation forest'},
{'label': 'ML', 'pattern': 'isotropic position'},
{'label': 'ML', 'pattern': 'item response theory'},
{'label': 'ML', 'pattern': 'iterative viterbi decoding'},
{'label': 'ML', 'pattern': 'java grammatical evolution'},
{'label': 'ML', 'pattern': 'jpred'},
{'label': 'ML', 'pattern': 'junction tree algorithm'},
{'label': 'ML', 'pattern': 'k-nearest neighbors'},
{'label': 'ML', 'pattern': 'kalman filter'},
{'label': 'ML', 'pattern': 'katzs back-off model'},
{'label': 'ML', 'pattern': 'KBpedia series'},
{'label': 'ML', 'pattern': 'keras'},
{'label': 'ML', 'pattern': 'kernel adaptive filter'},
{'label': 'ML', 'pattern': 'kernel density estimation'},
{'label': 'ML', 'pattern': 'kernel eigenvoice'},
{'label': 'ML', 'pattern': 'kernel embedding of distributions'},
{'label': 'ML', 'pattern': 'kernel method'},
{'label': 'ML', 'pattern': 'kernel perceptron'},
{'label': 'ML', 'pattern': 'kernel principal component analysis'},
{'label': 'ML', 'pattern': 'kinect'},
{'label': 'ML', 'pattern': 'knowledge distillation'},
{'label': 'ML', 'pattern': 'knowledge integration'},
{'label': 'ML', 'pattern': 'label propagation algorithm'},
{'label': 'ML', 'pattern': 'labeled data'},
{'label': 'ML', 'pattern': 'language acquisition device'},
{'label': 'ML', 'pattern': 'language model'},
{'label': 'ML', 'pattern': 'large margin nearest neighbor'},
{'label': 'ML', 'pattern': 'large memory storage and retrieval neural network'},
{'label': 'ML', 'pattern': 'latent class model'},
{'label': 'ML', 'pattern': 'latent dirichlet allocation'},
{'label': 'ML', 'pattern': 'latent semantic analysis'},
{'label': 'ML', 'pattern': 'latent variable'},
{'label': 'ML', 'pattern': 'latent variable model'},
{'label': 'ML', 'pattern': 'lazy learning'},
{'label': 'ML', 'pattern': 'leabra'},
{'label': 'ML', 'pattern': 'leakage'},
{'label': 'ML', 'pattern': 'learnable function class'},
{'label': 'ML', 'pattern': 'learning automaton'},
{'label': 'ML', 'pattern': 'learning classifier system'},
{'label': 'ML', 'pattern': 'learning curve'},
{'label': 'ML', 'pattern': 'learning rate'},
{'label': 'ML', 'pattern': 'learning rule'},
{'label': 'ML', 'pattern': 'learning to rank'},
{'label': 'ML', 'pattern': 'learning vector quantization'},
{'label': 'ML', 'pattern': 'learning with errors'},
{'label': 'ML', 'pattern': 'least-squares support-vector machine'},
{'label': 'ML', 'pattern': 'leave-one-out error'},
{'label': 'ML', 'pattern': 'leela chess zero'},
{'label': 'ML', 'pattern': 'leela zero'},
{'label': 'ML', 'pattern': 'lenet'},
{'label': 'ML', 'pattern': 'lernmatrix'},
{'label': 'ML', 'pattern': 'life-time of correlation'},
{'label': 'ML', 'pattern': 'lightgbm'},
{'label': 'ML', 'pattern': 'linde–buzo–gray algorithm'},
{'label': 'ML', 'pattern': 'linear classifier'},
{'label': 'ML', 'pattern': 'linear discriminant analysis'},
{'label': 'ML', 'pattern': 'linear genetic programming'},
{'label': 'ML', 'pattern': 'linear predictor function'},
{'label': 'ML', 'pattern': 'linear separability'},
{'label': 'ML', 'pattern': 'liquid state machine'},
{'label': 'ML', 'pattern': 'list of datasets for machine-learning research'},
{'label': 'ML', 'pattern': 'local case-control sampling'},
{'label': 'ML', 'pattern': 'local independence'},
{'label': 'ML', 'pattern': 'local outlier factor'},
{'label': 'ML', 'pattern': 'local tangent space alignment'},
{'label': 'ML', 'pattern': 'locality-sensitive hashing'},
{'label': 'ML', 'pattern': 'log-linear model'},
{'label': 'ML', 'pattern': 'logic learning machine'},
{'label': 'ML', 'pattern': 'logitboost'},
{'label': 'ML', 'pattern': 'long short-term memory'},
{'label': 'ML', 'pattern': 'loss function'},
{'label': 'ML', 'pattern': 'loss functions for classification'},
{'label': 'ML', 'pattern': 'low-rank approximation'},
{'label': 'ML', 'pattern': 'low-rank matrix approximations'},
{'label': 'ML', 'pattern': 'lpboost'},
{'label': 'ML', 'pattern': 'm-theory'},
{'label': 'ML', 'pattern': 'machine learning'},
{'label': 'ML', 'pattern': 'machine_learning'},            
{'label': 'ML', 'pattern': 'manifold alignment'},
{'label': 'ML', 'pattern': 'manifold regularization'},
{'label': 'ML', 'pattern': 'margin classifier'},
{'label': 'ML', 'pattern': 'margin-infused relaxed algorithm'},
{'label': 'ML', 'pattern': 'markov blanket'},
{'label': 'ML', 'pattern': 'markov chain'},
{'label': 'ML', 'pattern': 'markov chain central limit theorem'},
{'label': 'ML', 'pattern': 'markov chain geostatistics'},
{'label': 'ML', 'pattern': 'markov chain monte carlo'},
{'label': 'ML', 'pattern': 'markov information source'},
{'label': 'ML', 'pattern': 'markov model'},
{'label': 'ML', 'pattern': 'markov partition'},
{'label': 'ML', 'pattern': 'markov property'},
{'label': 'ML', 'pattern': 'markov switching multifractal'},
{'label': 'ML', 'pattern': 'markovian discrimination'},
{'label': 'ML', 'pattern': 'matchbox educable noughts and crosses engine'},
{'label': 'ML', 'pattern': 'matrix regularization'},
{'label': 'ML', 'pattern': 'matthews correlation coefficient'},
{'label': 'ML', 'pattern': 'maximum-entropy markov model'},
{'label': 'ML', 'pattern': 'mean squared error'},
{'label': 'ML', 'pattern': 'mean squared prediction error'},
{'label': 'ML', 'pattern': 'measurement invariance'},
{'label': 'ML', 'pattern': 'medoid'},
{'label': 'ML', 'pattern': 'megahal'},
{'label': 'ML', 'pattern': 'melomics'},
{'label': 'ML', 'pattern': 'memetic algorithm'},
{'label': 'ML', 'pattern': 'memtransistor'},
{'label': 'ML', 'pattern': 'meta learning'},
{'label': 'ML', 'pattern': 'meta-optimization'},
{'label': 'ML', 'pattern': 'microsoft cognitive toolkit'},
{'label': 'ML', 'pattern': 'minimum population search'},
{'label': 'ML', 'pattern': 'minimum redundancy feature selection'},
{'label': 'ML', 'pattern': 'mixture model'},
{'label': 'ML', 'pattern': 'mixture of experts'},
{'label': 'ML', 'pattern': 'ml.net'},
{'label': 'ML', 'pattern': 'mlops'},
{'label': 'ML', 'pattern': 'model-free'},
{'label': 'ML', 'pattern': 'models of dna evolution'},
{'label': 'ML', 'pattern': 'modes of variation'},
{'label': 'ML', 'pattern': 'modular neural network'},
{'label': 'ML', 'pattern': 'moea framework'},
{'label': 'ML', 'pattern': 'mokken scale'},
{'label': 'ML', 'pattern': 'moneybee'},
{'label': 'ML', 'pattern': 'moral graph'},
{'label': 'ML', 'pattern': 'mountain car problem'},
{'label': 'ML', 'pattern': 'multi expression programming'},
{'label': 'ML', 'pattern': 'multi-agent learning'},
{'label': 'ML', 'pattern': 'multi-armed bandit'},
{'label': 'ML', 'pattern': 'multi-label classification'},
{'label': 'ML', 'pattern': 'multi-objective reinforcement learning'},
{'label': 'ML', 'pattern': 'multi-surface method'},
{'label': 'ML', 'pattern': 'multi-task learning'},
{'label': 'ML', 'pattern': 'multiclass classification'},
{'label': 'ML', 'pattern': 'multidimensional analysis'},
{'label': 'ML', 'pattern': 'multidimensional scaling'},
{'label': 'ML', 'pattern': 'multifactor dimensionality reduction'},
{'label': 'ML', 'pattern': 'multilayer perceptron'},
{'label': 'ML', 'pattern': 'multilinear principal component analysis'},
{'label': 'ML', 'pattern': 'multilinear subspace learning'},
{'label': 'ML', 'pattern': 'multimodal learning'},
{'label': 'ML', 'pattern': 'multimodal sentiment analysis'},
{'label': 'ML', 'pattern': 'multinomial logistic regression'},
{'label': 'ML', 'pattern': 'multiple correspondence analysis'},
{'label': 'ML', 'pattern': 'multiple discriminant analysis'},
{'label': 'ML', 'pattern': 'multiple discriminant analysis'},
{'label': 'ML', 'pattern': 'multiple instance learning'},
{'label': 'ML', 'pattern': 'multiple kernel learning'},
{'label': 'ML', 'pattern': 'multiple sequence alignment'},
{'label': 'ML', 'pattern': 'multiple-instance learning'},
{'label': 'ML', 'pattern': 'multiplicative weight update method'},
{'label': 'ML', 'pattern': 'multispectral pattern recognition'},
{'label': 'ML', 'pattern': 'multitask optimization'},
{'label': 'ML', 'pattern': 'multivariate adaptive regression spline'},
{'label': 'ML', 'pattern': 'naive bayes classifier'},
{'label': 'ML', 'pattern': 'native-language identification'},
{'label': 'ML', 'pattern': 'natural evolution strategy'},
{'label': 'ML', 'pattern': 'natural language toolkit'},
{'label': 'ML', 'pattern': 'nature machine intelligence'},
{'label': 'ML', 'pattern': 'nearest centroid classifier'},
{'label': 'ML', 'pattern': 'nearest neighbor search'},
{'label': 'ML', 'pattern': 'neocognitron'},
{'label': 'ML', 'pattern': 'netomi'},
{'label': 'ML', 'pattern': 'nettalk'},
{'label': 'ML', 'pattern': 'neural cryptography'},
{'label': 'ML', 'pattern': 'neural designer'},
{'label': 'ML', 'pattern': 'neural gas'},
{'label': 'ML', 'pattern': 'neural modeling fields'},
{'label': 'ML', 'pattern': 'neural network gaussian process'},
{'label': 'ML', 'pattern': 'neural network intelligence'},
{'label': 'ML', 'pattern': 'neural network software'},
{'label': 'ML', 'pattern': 'neural network synchronization protocol'},
{'label': 'ML', 'pattern': 'neural networks'},
{'label': 'ML', 'pattern': 'neural style transfer'},
{'label': 'ML', 'pattern': 'neural tangent kernel'},
{'label': 'ML', 'pattern': 'neural turing machine'},
{'label': 'ML', 'pattern': 'neuroevolution'},
{'label': 'ML', 'pattern': 'neuroevolution of augmenting topologies'},
{'label': 'ML', 'pattern': 'ni1000'},
{'label': 'ML', 'pattern': 'niki.ai'},
{'label': 'ML', 'pattern': 'node2vec'},
{'label': 'ML', 'pattern': 'noisy channel model'},
{'label': 'ML', 'pattern': 'noisy text analytics'},
{'label': 'ML', 'pattern': 'non-negative matrix factorization'},
{'label': 'ML', 'pattern': 'nonlinear dimensionality reduction'},
{'label': 'ML', 'pattern': 'normal discriminant analysis'},
{'label': 'ML', 'pattern': 'novelty detection'},
{'label': 'ML', 'pattern': 'nuisance variable'},
{'label': 'ML', 'pattern': 'nvdla'},
{'label': 'ML', 'pattern': 'object co-segmentation'},
{'label': 'ML', 'pattern': 'occam learning'},
{'label': 'ML', 'pattern': 'offline learning'},
{'label': 'ML', 'pattern': 'ojas rule'},
{'label': 'ML', 'pattern': 'one-class classification'},
{'label': 'ML', 'pattern': 'one-shot learning'},
{'label': 'ML', 'pattern': 'online machine learning'},
{'label': 'ML', 'pattern': 'onnx'},
{'label': 'ML', 'pattern': 'ontology learning'},
{'label': 'ML', 'pattern': 'openai api'},
{'label': 'ML', 'pattern': 'openai five'},
{'label': 'ML', 'pattern': 'opennn'},
{'label': 'ML', 'pattern': 'openvino'},
{'label': 'ML', 'pattern': 'operational taxonomic unit'},
{'label': 'ML', 'pattern': 'optical character recognition'},
{'label': 'ML', 'pattern': 'optical neural network'},
{'label': 'ML', 'pattern': 'optimal discriminant analysis and classification tree analysis'},
{'label': 'ML', 'pattern': 'oscillatory neural network'},
{'label': 'ML', 'pattern': 'out-of-bag error'},
{'label': 'ML', 'pattern': 'outline of machine learning'},
{'label': 'ML', 'pattern': 'overfitting'},
{'label': 'ML', 'pattern': 'pachinko allocation'},
{'label': 'ML', 'pattern': 'pagerank'},
{'label': 'ML', 'pattern': 'paraphrasing'},
{'label': 'ML', 'pattern': 'parity benchmark'},
{'label': 'ML', 'pattern': 'parity learning'},
{'label': 'ML', 'pattern': 'part-of-speech tagging'},
{'label': 'ML', 'pattern': 'partial least squares regression'},
{'label': 'ML', 'pattern': 'particle swarm optimization'},
{'label': 'ML', 'pattern': 'path dependence'},
{'label': 'ML', 'pattern': 'pattern language'},
{'label': 'ML', 'pattern': 'pattern recognition'},
{'label': 'ML', 'pattern': 'perceptron'},
{'label': 'ML', 'pattern': 'physical neural network'},
{'label': 'ML', 'pattern': 'plate notation'},
{'label': 'ML', 'pattern': 'polynomial kernel'},
{'label': 'ML', 'pattern': 'pop music automation'},
{'label': 'ML', 'pattern': 'population process'},
{'label': 'ML', 'pattern': 'portable format for analytics'},
{'label': 'ML', 'pattern': 'predictive learning'},
{'label': 'ML', 'pattern': 'predictive model markup language'},
{'label': 'ML', 'pattern': 'predictive state representation'},
{'label': 'ML', 'pattern': 'preference learning'},
{'label': 'ML', 'pattern': 'preference regression'},
{'label': 'ML', 'pattern': 'prefrontal cortex basal ganglia working memory'},
{'label': 'ML', 'pattern': 'principal component analysis'},
{'label': 'ML', 'pattern': 'prior knowledge for pattern recognition'},
{'label': 'ML', 'pattern': 'proactive learning'},
{'label': 'ML', 'pattern': 'proaftn'},
{'label': 'ML', 'pattern': 'probabilistic context-free grammar'},
{'label': 'ML', 'pattern': 'probabilistic latent semantic analysis'},
{'label': 'ML', 'pattern': 'probabilistic neural network'},
{'label': 'ML', 'pattern': 'probability matching'},
{'label': 'ML', 'pattern': 'probably approximately correct learning'},
{'label': 'ML', 'pattern': 'probit model'},
{'label': 'ML', 'pattern': 'product of experts'},
{'label': 'ML', 'pattern': 'progol'},
{'label': 'ML', 'pattern': 'programming by example'},
{'label': 'ML', 'pattern': 'promoter based genetic algorithm'},
{'label': 'ML', 'pattern': 'proper generalized decomposition'},
{'label': 'ML', 'pattern': 'prototype methods'},
{'label': 'ML', 'pattern': 'proximal gradient method'},
{'label': 'ML', 'pattern': 'pulse-coupled networks'},
{'label': 'ML', 'pattern': 'pvlv'},
{'label': 'ML', 'pattern': 'q-learning'},
{'label': 'ML', 'pattern': 'quadratic classifier'},
{'label': 'ML', 'pattern': 'quadratic unconstrained binary optimization'},
{'label': 'ML', 'pattern': 'quantum machine learning'},
{'label': 'ML', 'pattern': 'quantum markov chain'},
{'label': 'ML', 'pattern': 'quantum neural network'},
{'label': 'ML', 'pattern': 'query-level feature'},
{'label': 'ML', 'pattern': 'question answering'},
{'label': 'ML', 'pattern': 'queueing theory'},
{'label': 'ML', 'pattern': 'quickprop'},
{'label': 'ML', 'pattern': 'rademacher complexity'},
{'label': 'ML', 'pattern': 'radial basis function'},
{'label': 'ML', 'pattern': 'radial basis function kernel'},
{'label': 'ML', 'pattern': 'radial basis function network'},
{'label': 'ML', 'pattern': 'ramnets'},
{'label': 'ML', 'pattern': 'random forest'},
{'label': 'ML', 'pattern': 'random indexing'},
{'label': 'ML', 'pattern': 'random neural network'},
{'label': 'ML', 'pattern': 'random projection'},
{'label': 'ML', 'pattern': 'random subspace method'},
{'label': 'ML', 'pattern': 'randomized weighted majority algorithm'},
{'label': 'ML', 'pattern': 'ranking svm'},
{'label': 'ML', 'pattern': 'reasoning system'},
{'label': 'ML', 'pattern': 'rectifier'},
{'label': 'ML', 'pattern': 'recurrent neural network'},
{'label': 'ML', 'pattern': 'recursive neural network'},
{'label': 'ML', 'pattern': 'region based convolutional neural networks'},
{'label': 'ML', 'pattern': 'reinforcement learning'},
{'label': 'ML', 'pattern': 'relation network'},
{'label': 'ML', 'pattern': 'relational data mining'},
{'label': 'ML', 'pattern': 'relationship square'},
{'label': 'ML', 'pattern': 'relevance vector machine'},
{'label': 'ML', 'pattern': 'representer theorem'},
{'label': 'ML', 'pattern': 'reservoir computing'},
{'label': 'ML', 'pattern': 'residual neural network'},
{'label': 'ML', 'pattern': 'restricted boltzmann machine'},
{'label': 'ML', 'pattern': 'revoscalepy'},
{'label': 'ML', 'pattern': 'revoscaler'},
{'label': 'ML', 'pattern': 'reward-based selection'},
{'label': 'ML', 'pattern': 'right to explanation'},
{'label': 'ML', 'pattern': 'rnn'},
{'label': 'ML', 'pattern': 'robot learning'},
{'label': 'ML', 'pattern': 'robotic process automation'},
{'label': 'ML', 'pattern': 'robust principal component analysis'},
{'label': 'ML', 'pattern': 'rprop'},
{'label': 'ML', 'pattern': 'rule induction'},
{'label': 'ML', 'pattern': 'rule-based machine learning'},
{'label': 'ML', 'pattern': 'rules extraction system family'},
{'label': 'ML', 'pattern': 'sammon mapping'},
{'label': 'ML', 'pattern': 'sample complexity'},
{'label': 'ML', 'pattern': 'sample exclusion dimension'},
{'label': 'ML', 'pattern': 'santa fe trail problem'},
{'label': 'ML', 'pattern': 'scale-invariant feature operator'},
{'label': 'ML', 'pattern': 'scikit-multiflow'},
{'label': 'ML', 'pattern': 'self-organizing map'},
{'label': 'ML', 'pattern': 'semantic analysis'},
{'label': 'ML', 'pattern': 'semantic folding'},
{'label': 'ML', 'pattern': 'semantic mapping'},
{'label': 'ML', 'pattern': 'semantic neural network'},
{'label': 'ML', 'pattern': 'semi-supervised learning'},
{'label': 'ML', 'pattern': 'semidefinite embedding'},
{'label': 'ML', 'pattern': 'sense networks'},
{'label': 'ML', 'pattern': 'sentence embedding'},
{'label': 'ML', 'pattern': 'seq2seq'},
{'label': 'ML', 'pattern': 'sequence labeling'},
{'label': 'ML', 'pattern': 'sequential minimal optimization'},
{'label': 'ML', 'pattern': 'shattered set'},
{'label': 'ML', 'pattern': 'siamese neural network'},
{'label': 'ML', 'pattern': 'sigmoid function'},
{'label': 'ML', 'pattern': 'similarity learning'},
{'label': 'ML', 'pattern': 'simultaneous localization and mapping'},
{'label': 'ML', 'pattern': 'sinkov statistic'},
{'label': 'ML', 'pattern': 'skill chaining'},
{'label': 'ML', 'pattern': 'sliced inverse regression'},
{'label': 'ML', 'pattern': 'soboleva modified hyperbolic tangent'},
{'label': 'ML', 'pattern': 'soft output viterbi algorithm'},
{'label': 'ML', 'pattern': 'softmax function'},
{'label': 'ML', 'pattern': 'solomonoffs theory of inductive inference'},
{'label': 'ML', 'pattern': 'sparse dictionary learning'},
{'label': 'ML', 'pattern': 'sparse pca'},
{'label': 'ML', 'pattern': 'speech recognition'},
{'label': 'ML', 'pattern': 'spike-and-slab regression'},
{'label': 'ML', 'pattern': 'spiking neural network'},
{'label': 'ML', 'pattern': 'spiral optimization algorithm'},
{'label': 'ML', 'pattern': 'squeezenet'},
{'label': 'ML', 'pattern': 'state–action–reward–state–action'},
{'label': 'ML', 'pattern': 'statistical classification'},
{'label': 'ML', 'pattern': 'statistical learning theory'},
{'label': 'ML', 'pattern': 'statistical machine translation'},
{'label': 'ML', 'pattern': 'statistical parsing'},
{'label': 'ML', 'pattern': 'statistical relational learning'},
{'label': 'ML', 'pattern': 'statistical semantics'},
{'label': 'ML', 'pattern': 'stochastic block model'},
{'label': 'ML', 'pattern': 'stochastic cellular automaton'},
{'label': 'ML', 'pattern': 'stochastic gradient descent'},
{'label': 'ML', 'pattern': 'stochastic grammar'},
{'label': 'ML', 'pattern': 'stochastic matrix'},
{'label': 'ML', 'pattern': 'stochastic neural analog reinforcement calculator'},
{'label': 'ML', 'pattern': 'stochastic neural network'},
{'label': 'ML', 'pattern': 'stress majorization'},
{'label': 'ML', 'pattern': 'string kernel'},
{'label': 'ML', 'pattern': 'structural equation modeling'},
{'label': 'ML', 'pattern': 'structural risk minimization'},
{'label': 'ML', 'pattern': 'structured knn'},
{'label': 'ML', 'pattern': 'structured prediction'},
{'label': 'ML', 'pattern': 'structured sparsity regularization'},
{'label': 'ML', 'pattern': 'structured support vector machine'},
{'label': 'ML', 'pattern': 'stylegan'},
{'label': 'ML', 'pattern': 'subclass reachability'},
{'label': 'ML', 'pattern': 'sufficient dimension reduction'},
{'label': 'ML', 'pattern': 'sukhotins algorithm'},
{'label': 'ML', 'pattern': 'sum of absolute differences'},
{'label': 'ML', 'pattern': 'sum of absolute transformed differences'},
{'label': 'ML', 'pattern': 'supervised learning'},
{'label': 'ML', 'pattern': 'support vector machine'},
{'label': 'ML', 'pattern': 'swish function'},
{'label': 'ML', 'pattern': 'switching kalman filter'},
{'label': 'ML', 'pattern': 'symbolic regression'},
{'label': 'ML', 'pattern': 'synaptic transistor'},
{'label': 'ML', 'pattern': 'synaptic weight'},
{'label': 'ML', 'pattern': 'synchronous context-free grammar'},
{'label': 'ML', 'pattern': 'syntactic pattern recognition'},
{'label': 'ML', 'pattern': 't-distributed stochastic neighbor embedding'},
{'label': 'ML', 'pattern': 'taguchi loss function'},
{'label': 'ML', 'pattern': 'tastedive'},
{'label': 'ML', 'pattern': 'td-gammon'},
{'label': 'ML', 'pattern': 'teaching dimension'},
{'label': 'ML', 'pattern': 'temporal difference learning'},
{'label': 'ML', 'pattern': 'tensor product network'},
{'label': 'ML', 'pattern': 'tensor sketch'},
{'label': 'ML', 'pattern': 'tensorflow'},
{'label': 'ML', 'pattern': 'text mining'},
{'label': 'ML', 'pattern': 'textual case-based reasoning'},
{'label': 'ML', 'pattern': 'tf–idf'},
{'label': 'ML', 'pattern': 'the emotion machine'},
{'label': 'ML', 'pattern': 'the master algorithm'},
{'label': 'ML', 'pattern': 'theano'},
{'label': 'ML', 'pattern': 'theory of conjoint measurement'},
{'label': 'ML', 'pattern': 'thurstonian model'},
{'label': 'ML', 'pattern': 'time aware long short-term memory'},
{'label': 'ML', 'pattern': 'time delay neural network'},
{'label': 'ML', 'pattern': 'time series'},
{'label': 'ML', 'pattern': 'timeline of machine learning'},
{'label': 'ML', 'pattern': 'topic model'},
{'label': 'ML', 'pattern': 'training, validation, and test sets'},
{'label': 'ML', 'pattern': 'transduction'},
{'label': 'ML', 'pattern': 'transfer learning'},
{'label': 'ML', 'pattern': 'transformer'},
{'label': 'ML', 'pattern': 'trigram tagger'},
{'label': 'ML', 'pattern': 'triplet loss'},
{'label': 'ML', 'pattern': 'tsetlin machine'},
{'label': 'ML', 'pattern': 'tucker decomposition'},
{'label': 'ML', 'pattern': 'types of artificial neural networks'},
{'label': 'ML', 'pattern': 'u-matrix'},
{'label': 'ML', 'pattern': 'u-net'},
{'label': 'ML', 'pattern': 'ugly duckling theorem'},
{'label': 'ML', 'pattern': 'uncertain data'},
{'label': 'ML', 'pattern': 'under-fitting'},
{'label': 'ML', 'pattern': 'underfitting'},
{'label': 'ML', 'pattern': 'uniform convergence in probability'},
{'label': 'ML', 'pattern': 'unique negative dimension'},
{'label': 'ML', 'pattern': 'universal approximation theorem'},
{'label': 'ML', 'pattern': 'universal portfolio algorithm'},
{'label': 'ML', 'pattern': 'unsupervised learning'},
{'label': 'ML', 'pattern': 'user behavior analytics'},
{'label': 'ML', 'pattern': 'validation set'},
{'label': 'ML', 'pattern': 'vanishing gradient problem'},
{'label': 'ML', 'pattern': 'vapnik–chervonenkis dimension'},
{'label': 'ML', 'pattern': 'vapnik–chervonenkis theory'},
{'label': 'ML', 'pattern': 'variable kernel density estimation'},
{'label': 'ML', 'pattern': 'variable-order bayesian network'},
{'label': 'ML', 'pattern': 'variable-order markov model'},
{'label': 'ML', 'pattern': 'variational message passing'},
{'label': 'ML', 'pattern': 'vector quantization'},
{'label': 'ML', 'pattern': 'version space learning'},
{'label': 'ML', 'pattern': 'visual temporal attention'},
{'label': 'ML', 'pattern': 'viterbi algorithm'},
{'label': 'ML', 'pattern': 'waca clustering algorithm'},
{'label': 'ML', 'pattern': 'waifu2x'},
{'label': 'ML', 'pattern': 'wake-sleep algorithm'},
{'label': 'ML', 'pattern': 'wavenet'},
{'label': 'ML', 'pattern': 'weak supervision'},
{'label': 'ML', 'pattern': 'weighted majority algorithm'},
{'label': 'ML', 'pattern': 'whitening transformation'},
{'label': 'ML', 'pattern': 'witness set'},
{'label': 'ML', 'pattern': 'word embedding'},
{'label': 'ML', 'pattern': 'word2vec'},
{'label': 'ML', 'pattern': 'writer invariant'},
{'label': 'ML', 'pattern': 'zero-shot learning'}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')

nlp.to_disk(output_dir)
print('Saved model to: ', output_dir)

Saved model to:  C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml

The rule-based matcher provides another step in the processing pipeline, so it can be readily combined with the existing NER recognizer. After completing the routine above, we are now able to invoke our new model which combines the existing en_core_web_sm model and the new ruler pipeline step into our new model, en_core_ml, as the code below shows. The following code takes our new model and uses it to generate a listing of the tags found in our input text:

import spacy
#import en_core_web_sm
import random
from spacy.util import minibatch, compounding
 
#nlp = en_core_web_sm.load()

model = r'C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml'
 
nlp = spacy.load(model)
#nlp.add_pipe(ruler)

ner=nlp.get_pipe('ner')

text = """With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, "data science") discussion, and the last part where significant code is developed and documented. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.
      Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.
      KBpedia's (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate 'slices' or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.
      Second, 80% of KBpedia's concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic 'signals'. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.
      And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The 'deep' appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.
      The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at 'standard' machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.
      The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.
      So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python 'ecosystems' and 'frameworks' in this part is to be better prepared to incorporate those innovations and learnings to come.
      Background
      One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.
      Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:
      Machine Learning Landscape
      Figure 1: Machine Learning Landscape (from S. Chen, "Machine Learning Algorithms For Beginners with Code Examples in Python", June 2020)
      There are many possible diagrams that one deep learning might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of 'classification' is a supervised one, 'clustering' a notion of unsupervised.
      We will include a 'standard' machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.
      """

move_names = list(ner.move_names)
assert nlp.get_pipe("ner").move_names == move_names
doc = nlp(text)
for ent in doc.ents:
    print(ent.label_, ent.text)

ORG the Cooking with Python
ML KBpedia series
LOC Part VI
CARDINAL seven
ML machine learning
CARDINAL 11
CARDINAL four
GPE roadmap
GPE KBpedia
GPE KBpedia
CARDINAL One
ORG KBpedia's
CARDINAL three
ML machine learning
ORDINAL First
CARDINAL about 56,000
GPE KBpedia
ML machine learning
GPE KBpedia
CARDINAL tens of millions
CARDINAL tens of thousands
GPE KBpedia
ORDINAL Second
PERCENT 80%
GPE KBpedia
ML word embedding
GPE KBpedia
GPE KBpedia
ML word embedding
ORDINAL third
ML machine learning
ML deep learning
ORDINAL first
ML machine learning
ML neural networks
ML neural networks
ML CNN
ML neural networks
ORG RNN
PERSON Graphs
ML deep learning
DATE the past five years
CARDINAL eleven
ORDINAL first
ORG Python
ML machine learning
PERSON PyTorch
CARDINAL four
ORG NLP
ML deep learning
CARDINAL two
CARDINAL four
GPE Clojure
GPE KBpedia
GPE AI
CARDINAL one
ORG Python
ORDINAL first
ML machine learning
PERSON Ronald Fisher
DATE the 1930s
GPE Iris
DATE today
ML linear discriminant analysis
CARDINAL dozens
ML machine learning
CARDINAL 1
CARDINAL one
PERSON ML
ML machine learning
NORP DL
ML deep learning
ORG Python
PERSON S. Chen
WORK_OF_ART Machine Learning Algorithms For Beginners with Code Examples in Python
DATE June 2020
CARDINAL one
ML deep learning
ML machine learning
ML supervised learning
ML unsupervised learning
ML reinforcement learning
ML machine learning
PRODUCT Graphs
ML machine learning
QUANTITY 5 yr
ML machine learning

If we want to see these tags in context to the original text, we can also invoke the visual annotator available in spaCy:

import spacy
from spacy import displacy
#import en_core_web_sm


model = r'C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml'

nlp = spacy.load(model)
#nlp = en_core_web_sm.load()

text = """With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, "data science") discussion, and the last part where significant code is developed and documented. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.
      Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.
      KBpedia's (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate 'slices' or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.
      Second, 80% of KBpedia's concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic 'signals'. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.
      And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The 'deep' appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.
      The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at 'standard' machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.
      The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.
      So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python 'ecosystems' and 'frameworks' in this part is to be better prepared to incorporate those innovations and learnings to come.
      Background
      One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.
      Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:
      Machine Learning Landscape
      Figure 1: Machine Learning Landscape (from S. Chen, "Machine Learning Algorithms For Beginners with Code Examples in Python", June 2020)
      There are many possible diagrams that one deep learning might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of 'classification' is a supervised one, 'clustering' a notion of unsupervised.
      We will include a 'standard' machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.
      """

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)

With this installment of the Cooking with PythonORGand KBpedia seriesML we move into Part VILOC of seven CARDINAL parts, a part with the bulk of the analytical and machine learning ML (that is, “data science”) discussion, and the last part where significant code is developed and documented. At the conclusion of this part, which itself has 11 CARDINAL installments, we have four CARDINAL installments to wrap up the series and provide a consistent roadmap GPE to the entire project. Knowledge graphs are unique information artifacts, and KBpedia GPE is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia GPE , but it is also a combination not duplicated anywhere else in the data science ecosystem. One CARDINAL of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics. KBpedia’s ORG (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three CARDINAL areas of data science and machine learning ML . First ORDINAL , the nearly universal scope and degree of topic coverage with about 56,000 CARDINAL concepts, logically organized into typologies with a high degree of disjointedness, means that accurate ‘slices’ or training sets may be extracted from KBpedia GPE nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning ML . We can extract these nearly for free from KBpedia GPE . Further, with its links to tens of millions CARDINAL of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands CARDINAL of conceptual entities in KBpedia GPE can be the retrieval points to nucleate training sets for fine-grained entity recognition. Second ORDINAL , 80% PERCENT of KBpedia GPE ‘s concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding ML models exist, the ones in KBpedia GPE are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic ‘signals’. To probe these assertions, we will create a unique KBpedia GPE -based word embedding ML corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets. And, third ORDINAL , perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning ML , especially innovations in geometric, heterogeneous methods for deep learning ML . The first ORDINAL generation of deep machine learning ML was designed for grid-patterned data and matrices through approaches such as deep neural networks ML , convolutional neural networks ML ( CNN ML ), or recurrent neural networks ML ( RNN ORG ). The ‘deep’ appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs PERSON , on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning ML , enables the efficient incorporation of text. It is only in the past five years DATE that concerted attention has been devoted to better capturing this feature richness for knowledge graphs. The eleven CARDINAL installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at ‘standard’ machine learners and deep learners. We will install the first ORDINAL generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions. The material below introduces and tees up these topics. We describe leading Python ORG packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning ML framework, PyTorch PERSON , to which we will then tie four CARDINAL different NLP ORG and deep learning ML libraries. We devote two CARDINAL installments each to these four CARDINAL libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure GPE posted online. So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia GPE and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI GPE and knowledge graphs. Thus, one CARDINAL of the reasons we emphasize Python ORG ‘ecosystems’ and ‘frameworks’ in this part is to be better prepared to incorporate those innovations and learnings to come. Background One of the first ORDINAL prototypes of machine learning ML comes from the statistician Ronald Fisher PERSON in the 1930s DATE regarding how to classify Iris GPE species based on the attributes of their flowers. It was a multivariate data example using the method we today DATE call linear discriminant analysis ML . This classic example is still taught. But many dozens CARDINAL of new algorithms and combined approaches have joined the machine learning ML field since then. Figure 1 CARDINAL below is one CARDINAL way to characterize the field, with ML PERSON standing for machine learning ML and DL NORP for deep learning ML , with this one oriented to sub-fields in which some Python ORG package already exists: Machine Learning Landscape Figure 1: Machine Learning Landscape (from S. Chen PERSON , ” Machine Learning Algorithms For Beginners with Code Examples in Python WORK_OF_ART “, June 2020 DATE ) There are many possible diagrams that one CARDINALdeep learning ML might prepare to show the machine learning ML landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning ML and unsupervised learning ML , (sometimes with reinforcement learning ML as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of ‘classification’ is a supervised one, ‘clustering’ a notion of unsupervised. We will include a ‘standard’ machine learning ML library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs PRODUCT pose a number of differences and challenges to standard machine learning ML . They have only been a recent ( 5 yr QUANTITY ) focus in machine learning ML , which is also rapidly changing over time.

We now see that our ‘ML’ tag has been added to the roster and other standard tags are shown.

Were this to be a production version, I would spend more time updating the training examples to remove some of the misassignments and would likely add some additional ML tags specific to our work with KBpedia (as opposed to the ones strictly from Wikipedia). Nonetheless, it seems like the rule-based approach is the better one for a topic area like ‘machine learning’ when we have a rather complete enumeration of important instances.

Other spaCy Functions

There is a wealth of additional functions that might be applied to KBpedia and its uses with the spaCy package. For example, this simple routine shows the variety of tags and characterizations that might be retrieved from text:

from spacy.gold import docs_to_json

doc = nlp("Machine learning is fun in Iowa.")
json_data = docs_to_json([doc])
print(json_data)

{'id': 0, 'paragraphs': [{'raw': 'Machine learning is fun in Iowa.', 'sentences': [{'tokens': [{'id': 0, 'orth': 'Machine', 'tag': 'NN', 'head': 1, 'dep': 'compound', 'ner': 'O'}, {'id': 1, 'orth': 'learning', 'tag': 'NN', 'head': 1, 'dep': 'nsubj', 'ner': 'O'}, {'id': 2, 'orth': 'is', 'tag': 'VBZ', 'head': 0, 'dep': 'ROOT', 'ner': 'O'}, {'id': 3, 'orth': 'fun', 'tag': 'JJ', 'head': -1, 'dep': 'attr', 'ner': 'O'}, {'id': 4, 'orth': 'in', 'tag': 'IN', 'head': -2, 'dep': 'prep', 'ner': 'O'}, {'id': 5, 'orth': 'Iowa', 'tag': 'NNP', 'head': -1, 'dep': 'pobj', 'ner': 'U-GPE'}, {'id': 6, 'orth': '.', 'tag': '.', 'head': -4, 'dep': 'punct', 'ner': 'O'}], 'brackets': []}], 'cats': []}]}

The combination of its rich functionality and pipeline abilities ensures spaCy is an NLP package of great capability. We could devote more write-ups to applications like topic modeling, word sense disambiguation, or relation extraction, but we need to move on in the next installment to classic machine learning.

Additional Documentation

Here is additional documentation in support of this installment.

Embeddings and Transformers

What Are Word Embeddings for Text?
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data “Rather than focussing on distances, knowledge graph embeddings focus on establishing a correspondence between the relations of the knowledge graph and geometric relation-ships in the latent space”
Word Embeddings in Python with Spacy and Gensim helpful with many code examples
Tutorial: Build your own Skip-gram Embeddings and use them in a Neural Network uses GenSim, skip-gram, and word2vec, but is based on NLTK and Kears, not our environment. Still, very informative
Transformers from Scratch is an excellent discussion of transformers with much accompanying PyTorch code
spaCy meets Transformers: Fine-tune BERT, XLNet and GPT-2
What is Word Embedding | Word2Vec | GloVe.

Text Summarization

Text Generation from Knowledge Graphs with Graph Transformers and its accompanying GraphWriter code
Knowledge Graph Empowered Entity Description Generation – structure to description generator; does not seem that easily repeatable.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.

NOTE: This CWPK installment is available both as an online interactive file

or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.

I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Schema.org Markup

headline:

CWPK #64: Embeddings, Summarization and Entity Recognition

alternativeHeadline:

Some Machine Learning Applied to NLP Problems

author:

Mike Bergman

image:

https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:

In this CWPK installment we process natural language text and use it for creating word and document embedding models using gensim and a very powerful NLP package, spaCy. We will not explore all aspects of NLP, but will focus on text summarization, and (named) entity recognition using both models and rule-based methods.

articleBody:

see above

datePublished:

November 12, 2020

Posted:November 12, 2020

CWPK #64: Embeddings, Summarization and Entity Recognition

Some Machine Learning Applied to NLP Problems

Word and Document Embedding

Similarity Analysis

Text Summarization

(Named) Entity Recognition – Update Model

Get Entity Pages

Chunk Text into ‘Sentences’

Extract and Offset Entities

Train the Recognizer

Continue Training the Recognizer

(Named) Entity Recognition – Rule-Based

Other spaCy Functions

Additional Documentation

Embeddings and Transformers

Text Summarization

Schema.org Markup

Leave a Reply

Main Links

Search

Categories

Calendar

Archives