Posted:November 23, 2020

Welcome to a Lengthy Installment on Feature Engineering

We devote the next two installments of Cooking with Python and KBpedia to the venerable Python machine learning package, scikit-learn. Also known as ‘sklearn’, this package offers a wealth of classic machine learning methods and utilities, along with abilities to construct machine learning pipelines and collect and present results via a rich set of statistical measures.

Though it may be a distinction without a difference, this installment marks our transition in this Part VI to packages devoted to machine learning. Though our earlier discussions of gensim and spaCy highlighted packages that employ much machine learning in their capabilities, their focus is not strictly machine learning in the same way as our remaining packages. After two installments on classic machine learning with scikit-learn, the remaining five installments in this Part VI are dedicated to deep learning with knowledge graphs.

In this installment we will set-up scikit-learn and prep a master data file as input to it (and to subsequent packages). The efforts we detail in this installment — one of the longest in our CWPK series — falls into the discipline known as ‘feature engineering‘, the process of extracting and crafting numeric representations staged properly for machine learning. In this installment, we load all of the necessary pieces and then proceed through a pipeline of data wrangling steps to create our vector file of numeric representations directly usable by the machine learners. This information then prepares us for the following installment, where we will set up our experimental training sets, establish a ‘gold standard‘ by which we calculate statistical performance, and begin some initial classification. We also set up our framework for reporting and comparing results in all of our machine learning installments.

Install scikit-learn

Because of our initial use of Anaconda, we already have scikit-learn installed. We can confirm this with our standard command:

conda list

We will have occasion to add some extensions, but will do so in context as the need arises.

Basic Intro to sklearn

scikit-learn has been an actively used and developed Python package since its initial release in 2007. It is a veritable Swiss army knife of machine learning algorithms and utilities. It has extensive and clear documentation and many, many online examples to help guide the way in the use of the package. However, that being said, there is much to learn about this package and much work to be done with the package in setting up raw data for proper machine learning use.

scikit-learn’s API documentation is the best introductory source to gain an appreciation for the scope of the package’s capabilities. Here are some of the major categories of sklearn’s capabilities:

methods other functions utilities
ensemble methods dimensionality reduction warnings and exceptions
learning methods kernel operations metrics
classification methods dataset functions preprocessing
neural networks feature selection pipelines
SVM feature extraction sample generators
decision trees decomposition plotting
linear models regressions splitting
manifold learning validation normalization
nearest neighbor Bayesian statistics randomization

sklearn offers a diversity of supervised and unsupervised learning methods, as well as dataset transformations and data loading routines. There are about a dozen supervised methods in the package, and nearly ten unsupervised methods. Multiple utilities exist for transferring datasets to other Python data science packages, including the pandas, gensim and PyTorch ones used in CWPK. Other prominent data formats readable by sklearn include SciPy and its binary formats, NumPy arrays, the libSVM sparse format, and common formats including CSV, Excel, JSON and SQL. sklearn provides converters for translating string and categorial formats into numeric representations usable by the learning methods.

A user guide provides examples in how to use most of these capabilities, and related projects list dozens of other Python packages that work closely with sklearn. Online searches turn up tens to hundreds of examples of how to work with the package. We only touch upon a few of scikit-learn’s capabilities in our series. But, clearly, the package’s scope warrants having it as an essential part of your data scienct toolkit.

Prepping the Text Data

In earlier installments we introduced the three main portions of KBpedia data in structure, annotations and pages that can contribute features to our machine learning efforts. What we now want to do is to consolidate these parts into a single master that may form the reference source for our learning efforts moving forward. Mixing and matching various parts of this master data will enable us to generate a variety of dataset configurations that we may test and compare.

Overall, there are about 15 steps in this process of creating a master input file. This is one of the most time-consuming tasks in the entire CWPK effort. There is a good reason why data preparation is given such prominent recognition in most discussions of machine learning. However, it is also the case that stepwise examples about how exactly to conduct such preparations is not well documented. As a result, we try to provide more than the standard details below.

Here are the major steps we undertake to prepare our KBpedia data for machine learning:

1. Assemble contributing data parts

To refresh our memories, here are the three key input files to our master data compilation:

  • structureC:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv
  • annotationsC:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv
  • pagesC:/1-PythonProjects/kbpedia/v300/models/inputs/wikipedia-trigram.txt.

We can inspect these three input files in turn, in the order listed (correct for your own file references):

import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')

df
import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv')

df

In the case of the pages file we need to convert its .txt extension to .csv, add a header row of id, text as its first row, and remove any commas from its ID field. (These steps were not needed for the previous two files since they were already in pandas format.)

import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/wikipedia-trigram.csv')

df

2. Map files

We inspect each of the three files and create a master lookup table for what each includes (not shown). We see that only the annotations file has the correct number of reference concepts. It also has the largest number of fields (columns). As a result, we will use that file as our target for incorporating the information in the other two files. (By the way, entries marked with ‘NaN’ are empty.) We will use our IDs (called ‘source’ in the structure file) as the index key for matching the extended information.

Prior to mapping, we note that the annotations file does not have the updated underscores in place of the initial hyphens (see CWPK #48), so we load up that file, and manually make that change to the id, superclass, and subclass fields. We also delete two of the fields in the annotations file that provide no use to our learning efforts, editorialNote and isDefinedBy. After these changes, we name the file C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv. This is the master file we will use for consolidating our input information for all learning efforts moving forward.

NOTE: During development of these routines I typically use temporary master files for saving each interim step, which provides the opportunity to inspect each transitional step before moving forward. I only provide the concluding versions of these steps in C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv, which is listed on GitHub under the https://github.com/Cognonto/CWPK/tree/master/sandbox/models/inputs directory, along with many of the other inputs noted below. If you want to inspect interim versions as outlined in the steps below, you will need to reconstruct the steps locally.

3. Prepare structure file

In the structure file, only two fields occur that are not already in the annotations file, namely the count of direct subclass children (‘weight‘) and the supertype (ST). The ST field is a many-to-one, which means we need to loop over that field and combine instances into a single cell. We will look to CWPK #44 for some example looping code by iterating instances, only now using the standard ',' separator in place of the double pipes '||' used earlier.

Since we need to remove the ‘kko: prefix from the structure file (to correspond to the convention in the annotations file), we make a copy of the file and rename and save it to C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_struct_temp.csv. Since we now have an altered file, we can also remove the target column and rename sourceid and weightcount. With these changes we then pull in our earlier code and create the routine that will put all of the SuperTypes for a given reference concept (RC) on the same line, separated by the standard ',' separator. We will name the new output file C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_structure.csv. Here is the code (note the input file is sorted on id):

import csv

in_file = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_struct_temp.csv'
out_file = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_structure.csv'

with open(in_file, 'r', encoding='utf8') as input:
    reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'count', 'SuperType'])                 
    header = ['id', 'count', 'SuperType']
    with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           
        csv_out = csv.writer(output)
        x_st = ''
        x_id = ''
        row_out = ()
        flg = 0
        for row in reader:
            r_id = row['id']                
            r_cnt = row['count']
            r_st = row['SuperType']                                                              
            if r_id != x_id:                               #Note 1
                csv_out.writerow(row_out)
                x_id = r_id                                #Note 1
                x_cnt = r_cnt
                x_st = r_st
                flg = 1
                row_out = (x_id, x_cnt, x_st)
            elif r_id == x_id:
                x_st = x_st + ',' + r_st                   #Note 2
                flg = flg + 1
                row_out = (x_id, x_cnt, x_st)
    output.close()         
input.close()
print('KBpedia SuperType flattening is complete . . .')                                                          

This routine does produce an extra line at the top of the file that needs to be removed manually (at least the code here does not handle it). This basic routine looks to find the change in name for the reference concept ID (1) that signals a new series is occurring. The SuperTypes encountered that share the same RC but have different names, are added to a single list string (2).

To see the resulting file, here’s the statement:

import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_structure.csv')

df

4. Prepare pages file

We are dealing with files here that strain the capabilities of standard spreadsheets. Large files become hard to load and sluggish (like minutes of processing with LibreOffice) even when they do. I have needed to find some alternative file editors to handle large files.

NOTE: I have tried numerous large file editors that I discuss further in the CWPK #75 conclusion to this series, but at present I am using the free ‘hackable’ Atom editor. Atom is a highly configurable editor from GitHub suitable to large files. It has a rich ecosystem of ‘packages’ that provide specific functionality, including one (tablr) that provides CSV table editing and manipulation. If Atom proves inadequate which would likely force me to purchase an alternative, my current evaluations point to EmEditor, which also has a limited free option and a 30-day trial of the paid version (\$40 first year, \$20 yearly thereafter). Finally, in order to use Atom properly on CSV files it turns out I also needed to fix a known tablr JS issue.

Because we want to convert our pages file to a single vector representation by RC (the same listing as in the annotations file), we will wait on finishing preparing this file for the moment. We return to this question below.

5. Create merge routine

OK, so we now have properly formatted files for incorporation into our new master. Our files have reached a size where combining or joining them should be done programmatically, not via spreadsheet utilities.

pandas offers a number of join and merge routines, some based on row concatenations, others based on joins such as inner or outer patterned on SQL. In our case, we want to do what is called a ‘left join’ wherein the left-specified file retains all of its rows, while the right file matches where it can.

Clearly, there are many needs and ways for merging files, but here is one example for a ‘left join’:

import pandas as pd

file_a = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv'
file_b = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_structure.csv'
file_out = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_2.csv'

df_a = pd.read_csv(file_a)

df_b = pd.read_csv(file_b)

merged_inner = pd.merge(left=df_a, right=df_b, left_on='id', right_on='id')
# Since both sources share the 'id' column name, we could skip the 'left_on' and 'right_on' parameters

# What's the size of the output data?
merged_inner.shape
merged_inner

merged_inner.to_csv(file_out)

This works great, too. This is the pattern we will use for other merges. We have the green light to continue with our data preparations.

6, Fix columns in master file

We now return our attention to the master file. There are many small steps that must be taken to get the file into shape for encoding it for use by the scikit-learn machine learners. Preferably, of course, these steps get combined (once they are developed and tested) into an overall processing pipeline. We will discuss pipelines in later installments. But, for now, in order to understand many of the particulars involved, we will proceed baby step by baby step through these preparations. This does mean, however, that we need to wrap each of these steps into the standard routines of opening and closing files and identifying columns with the active data.

Generally, at the top of each baby step routine, all of which use pandas, we open the master from the prior step and then save the results under a new master name. This enables us to run the routine multiple times as we work out the kinks and to return to prior versions should we later discover processing issues:

import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv'

df

We can always invoke some of the standard pandas calls to see what the current version of the master file contains:

df.info()
cols = list(df.columns.values)
print(cols)

We can also manipulate, drop, or copy columns or change their display order:

df.drop('CleanDef', axis=1, inplace=True)
df = df[['Unnamed: 0', 'id', 'prefLabel', 'subClassOf', 'count', 'superClassOf', 'SuperType', 'altLabel', 'def_final']]
df.to_csv(out_f)
# Copies contents in 'SuperType' column to a new column location, 'pSuperType'
df.loc[:, 'pSuperType'] = df['SuperType']
df.to_csv(out_f)

We will first focus on the ‘definitions’ column in the master file. In addition to duplicating input files under an alternative output name, we also will move any changes in column data to new columns. This makes it easier to see changes and overwrites and to recover prior data. Again, once the routines are running correctly, it is possible to collapse this overkill into pipelines.

One of our earlier preparatory steps was to ensure that the Cyc hrefs first mentioned in CWPK #48 were removed from certain KBpedia definitions that had been obtained from OpenCyc. As before, we use the amazing BeautifulSoup HTML parser:

# Removal of Cyc hrefs
from bs4 import BeautifulSoup 

cleandef = []
for column in df[['definition']]:
    columnContent = df[column]
    for row in columnContent:
        line = str(row)
        soup = BeautifulSoup(line)                               
        tags = soup.select("a[href^='http://sw.opencyc.org/']")  
        if tags != []:
            for item in tags:                                    
                item.unwrap()                                    
                item_text = soup.get_text()                      
        else:
            item_text = line
        cleandef.append(item_text)

df['CleanDef'] = cleandef        
df.to_csv(out_f)
print('File written and closed.')

This example also shows us how to loop over a pandas column. The routine finds the matching href, picks out the open and closing tags, and retains the link label while it removes the link HTML.

In reviewing our data, we also observe that we have some string quoting issues and a few instances of ‘smart quotes’ embedded within our definitions:

# Forcing quotes around all definitions
cleandef = []
quote = '"'
for column in df[['CleanDef']]:
    columnContent = df[column]
    for row in columnContent:
        line = str(row)
        if line[0] != quote:
            line = quote + line + quote
            line = line.replace('"""', '"')
#        elif line[-1] != quote:
#            line = line + quote
        else:
            continue
        cleandef.append(line)

df['definition'] = cleandef
df.drop('Unnamed: 0', axis=1, inplace=True)
df.drop('CleanDef', axis=1, inplace=True)
df.to_csv(out_f)
print('File written and closed.')

We also need to ‘normalize’ the definitions text by converting it to lower case, removing punctuation and stop words, and other refinements. There are many useful text processing techniques in the following code block:

# Normalize the definitions
from gensim.parsing.preprocessing import remove_stopwords
from string import punctuation
from gensim import utils
import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv')
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_3.csv'

more_stops = ['b', 'c', 'category', 'com', 'd', 'f', 'formatnum', 'g', 'gave', 'gov', 'h', 
              'htm', 'html', 'http', 'https', 'id', 'isbn', 'j', 'k', 'l', 'loc', 'm', 'n', 
              'need', 'needed', 'org', 'p', 'properties', 'q', 'r', 's', 'took', 'url', 'use', 
              'v', 'w', 'www', 'y', 'z'] 

def is_digit(word):
    try:
        int(word)
        return True
    except ValueError:
        return False
    
tokendef = []
quote = '"'
for column in df[['definition']]:
    columnContent = df[column]
    i = 0
    for row in columnContent:
        line = str(row)
        try:
            # Lowercase the text
            line = line.lower()
            # Remove punctuation 
            line = line.translate(str.maketrans('', '', string.punctuation))
            # More preliminary cleanup
            line = line.replace("‘", "").replace("’", "").replace('-', ' ').replace('–', ' ').replace('↑', '')
            # Remove stopwords            
            line = remove_stopwords(line)
            splitwords = line.split()
            goodwords = [word for word in splitwords if word not in more_stops]
            line = ' '.join(goodwords)
            # Remove number strings (but not alphanumerics)
            new_line = []
            for word in line.split():
                if not is_digit(word):
                    new_line.append(word)
            line = ' '.join(new_line) 
#            print(line) 
        except Exception as e:
            print ('Exception error: ' + str(e))
        tokendef.append(line)  
        i = i + 1
df['tokendef'] = tokendef
df.drop('Unnamed: 0', axis=1, inplace=True)
df.drop('definition', axis=1, inplace=True)
df.to_csv(out_f)
print('Normalization applied to ' + str(i) + ' texts')
print('File written and closed.')

To be consistent with our page processing steps, we also will extract bigrams from the definition text. We had earlier worked out this routine (see CWPK #63) that we generalize here. This also provides the core routines for preprocessing input text for evaluations:

# Phrase extract bigrams, from CWPK #63
from gensim.models.phrases import Phraser, Phrases
import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_definition_short.csv')
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_definition_bigram.csv'

definition = []
for column in df[['definition']]:
    columnContent = df[column]
    i = 0
    for row in columnContent:
        line = str(row)
        try:
            splitwords = line.split()
            common_terms = ['aka']
            ngram = Phrases(splitwords, min_count=2, threshold=10, max_vocab_size=80000000, 
                    delimiter=b'_', common_terms=common_terms)
            ngram = Phraser(ngram)
            line = list(ngram[splitwords])           
            line = ', '.join(line)
            line = line.replace(', ', ' ')
            line = line.replace(' s ', '')         
#            print(line) 
        except Exception as e:
            print ('Exception error: ' + str(e))
        definition.append(line)  
        i = i + 1
df['def_bigram'] = definition
df.drop('Unnamed: 0', axis=1, inplace=True)
df.drop('definition', axis=1, inplace=True)
df.to_csv(out_f)
print('Phrase extraction applied to ' + str(i) + ' texts')
print('File written and closed.')
Phrase extraction applied to 58069 texts
File written and closed.

Some of the other desired changes could be done in the open spreadsheet, so no programmatic code is provided here. We wanted to update the use of underscores in all URI identifiers, retain case, and separate multiple entries by blank space rather than commas or double pipes. We made these changes via simple find-and-replace for the subClassOf, superClassOf, and SuperType columns. (The id column is already a single token, so it was only checked for underscores.)

Unlike the pages and definitions, which are all normalized and lowercased, we also wanted to remove punctuation, separate entries by spaces, and remove punctuation but retain case for the prefLabel and altLabel columns. Again, simple find-and-replace was used here.

Thus, aside from the pages that we still need to merge in vector form (see below), these baby steps complete all of the text preprocessing in our master file. For these columns, we now have the inputs to the later vectorizing routines.

7. ‘Push down’ SuperTypes

We discussed in CWPK #56 how we wanted to evaluate the SuperType assignments for a given KBpedia reference concept (RC). Our main desire is to give each RC its most specific SuperType assignments. Some STs are general, higher-level categories that provide limited or no discriminatory power.

We term the process of narrowing SuperType assignments to the lowest and most specific levels for a given RC a ‘push down’. The process we have defined for this first begins by pruning any mention of a more general category within an RCs current SuperType listing, unless doing so would leave the RC without an ST assignment. We supplement this approach with a second pass where we iterate one by one over some common STs and remove them unless it would leave the RC without an ST assignment. Here is that code, which we write into a new column so that we do not lose the starting listing:

# Narrow SuperTypes, per #CWPK #56
import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv'

parent_set = ['SocialSystems','Products','Methodeutic','Eukaryotes','ConceptualSystems',
              'AVInfo','Systems','Places', 'OrganicChemistry','MediativeRelations',
              'LivingThings', 'Information','CopulativeRelations','Artifacts','Agents',
              'TimeTypes','Symbolic','SpaceTypes','RepresentationTypes', 'RelationTypes',
              'LocationPlace', 'OrganicMatter','NaturalMatter', 'AttributeTypes','Predications',
              'Manifestations', 'Constituents', 'AdjunctualAttributes', 'IntrinsicAttributes',
              'ContextualAttributes', 'DirectRelations', 'Concepts', 'KnowledgeDomains', 'Shapes',
              'SituationTypes', 'Forms', 'Associatives', 'Denotatives', 'TopicsCategories',
              'Indexes', 'ActionTypes', 'AreaRegion']

second_pass = ['KnowledgeDomains', 'SituationTypes', 'Forms', 'Concepts', 'ActionTypes', 
               'AreaRegion', 'Artifacts', 'EventTypes']

clean_st = []
quote = '"'
for column in df[['SuperType']]:
    columnContent = df[column]
    i = 0
    for row in columnContent:
        line = str(row)
        try:
            line = line.replace(', ', ' ')
            splitclass = line.split()
            # Remove duplicates because dict only has uniques
            splitclass = list(dict.fromkeys(splitclass))
            line = ' '.join(splitclass)
            goodclass = [word for word in splitclass if word not in parent_set]
            test_line = ' '.join(goodclass)
            if test_line == '':
                clean_st.append(line)
            else:
                line = test_line
                clean_st.append(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
print('First pass count: ' + str(i))

# Second pass
print('Starting second pass . . . .')
clean2_st = []
i = 0
length = len(clean_st)
print('Length of clean_st: ', length)
ln = len(second_pass)
for row in clean_st:
    line = str(row)
    try_line = line
    for item in second_pass:
        word = str(item)
        try_line = str(try_line)
        try_line = try_line.strip()
        try_line = try_line.replace(word, '')
        try_line = try_line.strip()
        try_line = try_line.replace('  ', ' ')
        char = len(try_line)
        if char < 6:
            try_line = line
            line = line
        else:
            line = try_line
    clean2_st.append(line)                    
    print('line: ' + str(i) + ' ' + line) 
    i = i + 1

df['clean_ST'] = clean2_st
df.drop('Unnamed: 0', axis=1, inplace=True)
df = df[['id', 'prefLabel', 'subClassOf', 'count', 'superClassOf', 'SuperType', 'clean_ST', 'altLabel', 'def_final']]
df.to_csv(out_f, encoding='utf-8')
print('ST reduction applied to ' + str(i) + ' texts')
print('File written and closed.')

Depending on the number of print statements one might include, the listing above can produce a very long listing!

8. Final text revisions

We have a few minor changes to attend to prior to the numeric encoding of our data. The first revision is based on the fact that a minor portion of both altLabels and definitions are much longer than the others. We analyzed min, max, mean and median for these two text sets. We roughly doubled the size of the rough mean and median for each set, and trimmed the strings to a maximum length of 150 and 300 characters, respectively. We employed the textwrap package to make the splits at word boundaries:

import pandas as pd
import textwrap

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv'

texts = df['altLabel']
new_texts = []
for line in texts:
    line = str(line)
    length = len(line)
    if length == 0:
        new_line = line
    else:
        new_line = textwrap.shorten(line, width=150, placeholder='')
    new_texts.append(new_line)
    
df['altLabel'] = new_texts

df.drop('Unnamed: 0', axis=1, inplace=True)
#df.drop('def_final', axis=1, inplace=True)
df.to_csv(out_f, encoding='utf-8')
print('Definitions shortened.')
print('File written and closed.')

We repeated this routine for both text sets and also did some minor punctuation clean up for the altLabels. We have now completed all text modifications and have a clean text master. We name this file kbpedia_master_text.csv and keep it for archive.

Prepare Numeric Encodings

We now shift our gears from prepping and staging the text to encoding all information to a proper numeric form given its content.

9. Plan the numeric encoding

Only numbers can be represented to machine learning models. Here is one place where sklearn really shines with its utility functions.

The manner by which we encode fields needs to be geared to the information content we can and want to convey, so context and scale (or its reciprocal, reduction) play prominent roles in helping decide what form of encoding works best. Text and language encodings, which are among the most challenging, can range from naive unique numeric identifiers to adjacency or transformed or learned representations for given contexts, including relations, categories, sentences, paragraphs or documents. A finite set of ‘categories’, or other fields with a discrete number of targets, can be the targets of learning representations that encompass a range of members.

A sentence or document or image, for example, is often reduced to a fixed number of dimensions, sometimes ranging into the hundreds, that are represented by arrays of numbers. Initial encoding representations may be trained against a desired labeled set to adjust or transform those arrays to come into correspondence with their intended targets. Items with many dimensions can occupy sparse matrices where most or many values are zero (non-presence). To reduce the size of large arrays we may also undergo further compression or dimension reduction through techniques like principal component analysis, or PCA. Many other clustering or dimension reduction techniques exist.

The art or skill in machine learning often resides at the interface between raw input data and how it is transformed into these vectors recognizable by the computer. There are some automated ways to evaluate options and to help make parameter and model choice decisions, such as grid search. sklearn, again, has many useful methods and tools in these general areas. I am drinking from a fire hose, and you will too if you poke much in this area.

A general problem area, then, is often characterized by data that is heterogeneous in terms of what it captures, corresponding, if you will, to the different fields or columns one might find in a spreadsheet. We may have some numeric values that need to be normalized, some text information ranging from single category terms or identifiers to definitions or complete texts, or complex arrays that are themselves a transformation of some underlying information. At minimum, we can say that multiple techniques may be necessary for encoding multiple different input data columns.

In terms of big data, pandas is roughly the equivalent analog to the spreadsheet. A nice tutorial provides some guidance on working jointly with pandas and sklearn. One early alternative to capture this need to apply different transformations to different input data columns was the independent sklearn-pandas. I have looked closely at this option and I like its syntax and approach, but scikit-learn eventually adopted its own ColumnTransformer methods, and they have become the standard and more actively developed option.

ColumnTransformer is a sklearn method for picking individual columns from pandas data sets, especially for heterogeneous data, and can be combined into pipelines. This design is well suited to mixed data types, including how to render your pipelines with HTML, and to transform multiple columns based on pandas inputs. There is much to learn in this area, but perhaps start with this pretty good overview and then how these techniques might be combined into pipelines with custom transformers.

We will provide some ColumnTransformer examples, but I will also try to explain each baby step. I’ll talk more about individual techniques, but here are the encoding options we have identified to transfer our KBpedia master file:

Encoding Type Applicable Fields
one-hot clean_ST (also target)
count vector id, prefLabel, subClassOf, superClassOf
tfidf altLabel, definition
doc2vec page

We will cover these in the baby steps to follow.

10. Category (one-hot) encode

Category encoding is to take a column listing of strings (generally, and may also be multiple columns) and then convert the category strings to a unique number. sklearn has a function called LabelEncoder for this function. Since a single column of unique numbers might imply order or hierarchy to some learners, this approach may be followed by a OneHotEncoder where each category is given its own column with a binary match (1) or not (0) assigned to each columm depending on what categories it has. Depending on the number of categories, this column array can grow to quite a large size and pose memory issues. The category approach is definitely appropriate for a score of items at the low end, and perhaps up to 500 at the upper end. Since our active SuperType categories number about 70, we first explore this option.

scikit-learn has been actively developed of late, and this initial approach has been updated with an improved OneHotEncoder that works directly with strings paired with the ColumnTransformer estimator. If you research category encoding online you might be confused about these earlier and later descriptions. Know that the same general approach applies here of assigning an array of SuperType categories to each row entry in our master data.

However, as we saw before for our cleaned STs, some rows (reference concepts, or RCs) may have more than one category assigned. Though it is possible to run ColumnTransformer over multiple columns at once, sklearn produces a new output column for each input column. I was not able to find a way to convert multiple ST strings in a single column to their matching category values. I am pretty sure there is a way for experts to figure out this problem, but I was not able to do so.

Fortunately, in the process of investigating these matters I encountered the pandas function of get_dummies that does one-hot encoding directly. More fortunately still there is also a pandas string function that allows multiple values to be split into individual ones, that can be applied as str.get_dummies. Further, with a bit of other Python magic, we can take the synthetic headers derived from the unique SuperType classes and give them a st_ prefix and combine them (concatenate) into a new resulting pandas dataframe. The net result is a very slick and short routine for category encoding our clean SuperTypes:

import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_cleanst.csv'

df_2 = pd.concat([df, df['clean_ST'].str.get_dummies(sep=', ').rename(lambda x: 'st_' + x, axis='columns')], axis=1)

df_2.drop('Unnamed: 0', axis=1, inplace=True)

df_2.to_csv(out_f, encoding='utf-8')
print('Categories one-hot encoded.')
print('File written and closed.')

The result of this routine can be seen in its get_dummies dataframe:

import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_1.csv')

df.info()

df

We will work in the idea of sklearn pipelines at the end of this installment and the beginning of the next, which argues for keeping everything within this environment for grid search and other preprocessing and consistency tasks. But, we are also looking for the simplest methods and will also be using master files to drive a host of subsequent tasks. In this regard, we are not solely focused on keeping the analysis pipeline totally within scikit-learn, but establishing robust starting points for a diversity of machine learning platforms. In this regard, the use of the get_dummies approach may make some sense.

11. Fill in missing values and

12. Other categorical text (vocabulary) encode

scikit-learn machine learners do not work on unbalanced datasets. If one column has x number of items, other comparative columns need to have the same. One can pare down the training sets to the lowest common number, or one may provide estimates or ‘fill-ins’ for the open items. Helpfully, sklearn has a number of useful utilities including imputation for filling in missing values.

Since, in the case of our KBpedia master, all missing values relate to text (especially for the altLabels category), we can simply assign a blank space as our fill in using standard pandas utilities. However, one can also do replacements, conditional fill-ins, averages and the like depending on circumstances. If you have missing data, you should consult the package documentation. In our code example below (see note (1)), however, we limit ourselves to the filling in with blank spaces.

There are a number of preprocessing encoding methods in sklearn useful to text, including CountVectorizer, HashingVectorizer, TfidfTransformer, and TfidfVectorizer. The CountVectorizer creates a matrix of text tokens and their counts. The HashingVectorizer uses the hash trick to generate unique numbers for text tokens with the least amount of memory but no ability to later look up the contributing tokens. The TfidfVectorizer calculates both both term frequency and inverse document frequency, while the TfidfTransformer first requires the CountVectorizer to calculate term frequency.

The fields we need to encode include the prefLabel, the id, the subClassOf (parents), and the superClassOf (children) entries. The latter three all are based on the same listing of 58 K KBpedia reference concepts (RCs). The prefLabel has major overlap with these RCs, but the terms are not concatenated and some synonyms and other qualifiers appear in the list. Nonetheless, because there is a great deal of overlap in all four of these fields, it appears best that we use a combined vocabulary across all four fields.

The term frequency/inverse document frequency (TF/IDF) method is a proven statistical way to indicate the importance of a term in relation to an entire corpus. Though we are dealing with a limited vocabulary for our RCs, some RCs are more often in relationships with other RCs and some RCs are more frequently used than others. While the other methods would give us our needed numerical encoding, we first pick TF/IDF to test because it appears to retain the most useful information in its encoding.

After taking care of missing items, we want our coding routine, then, to construct a common vocabulary across our four subject text fields. This common vocabulary and its frequency counts is what we will use to calculate the document frequencies across all four columns. For this purpose we will use the CountVectorizer method. Then, for each of the four individual columns that comprise this vocabulary we will use the TfidfTransformer method to get the term frequencies for each entry and to construct its overall TF/IDF scores. We will need to construct this routine using the ColumnTransformer method described above.

There are many choices and nuances to learn with this approach. While doing so, I came to realize some significant downfalls. The number of unique tokens across all four columns is about 80,000. When parsed against the 58 K RCs it results in a huge matrix, but a very sparse one. There is an average of about four tokens across all four columns for each RC, and rarely does an RC have more than seven. This unusual profile, though, does make sense since we are dealing with a controlled vocabulary of 58 K tokens for three of the columns, and close matches and synonyms with some type designators for the prefLabel field. So, anytime our routines needed to access information in one of these four columns, we would incur a very large memory penalty. Since memory is a limiting factor in all cases, but especially so with my office-grade equipment, this is a heavy anchor to be dragging into the picture.

Our first need to create a shared vocabulary across all four input columns brought the first insight to bypass this large matrix conundrum. The CountVectorizer produces a tuple listing index and unique token ID. The token ID can be linked back to its text key, and the sequence of the token IDs is in text alphabetical order. Rather than needing to capture a large sparse matrix, we only need to record the IDs for the few matching terms. What is really cool about this structure is that we can reduce our memory demand by more than 14,000 times while being fully lossless with lookup to the actual text. Further, this vocabulary structure can be saved separately and incorporated in other learners and utilities.

So, our initial routine lays out how we can combine the contents of our four columns, which then becomes the basis for fitting the TfidfVectorizer. (The dictionary creation part of this routine is based on the TfidfVectorizer, so either may be used.) Let’s first look at this structure derived from our KBpedia master file:

# Concatenation code adapted from https://github.com/scikit-learn/scikit-learn/issues/16148

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')

df['altLabel'] = df['altLabel'].fillna(' ')              # Note 1
df['superClassOf'] = df['superClassOf'].fillna(' ')

def concat_cols(df):
    assert len(df.columns) >= 2
    res = df.iloc[:, 0].astype('str')
    for key in df.columns[1:]:
       res = res + ' ' + df[key]
    return res 

tf = TfidfVectorizer().fit(concat_cols(df[['id', 'prefLabel', 'subClassOf', 'superClassOf']]))
print(tf.vocabulary_)

This gives us a vocabulary and then a lookup basis for linking our individual columns in this group of four. Note that some of these terms are quite long, since they are the product of an already concatenated identifier creation. That also makes them very strong signals for possible text matches.

Nearly all of these types of machine learners first require the data to be ‘fit‘ and then ‘transformed‘. Fit means to make a new numeric representation for the input data including any format or validation checks. Transform means to convert that new representation to a form most useful to the learner at hand, such as a floating decimal for a frequency value, that may also be the result of some conversion or algorithmic changes. Each machine learner has its own approaches to fit and transform, and parameters that may be set when these functions are called may tweak methods further. These approaches may be combined together into a slighly more efficient ‘fit-transform‘ step, which is the approach we take in this example:

import csv
token_gen = make_column_transformer(
    (TfidfVectorizer(vocabulary=tf.vocabulary_),'prefLabel')
)
tfidf_array = token_gen.fit_transform(df)
print(tfidf_array)

So, we surmise, then, we can loop over these results sets for each of the four columns, loop over matching token IDs for each unique ID, and then to write out a simple array for each RC entry. In the case of id there will be a single matching token ID. For the other three columns, there is at most a few entries. This promises to provide a very efficient encoding that we can also tie into external routines as appropriate.

Alas, like much else that appears simple on the face of it, what one sees when printing this output is not the data form presented when saving it to file. For example, if one does a type(tfidf-array) we see that the object is actually a pretty complicated data structure, a scipy.sparse.csr_matrix. (scikit-learn is itself based on SciPy, which is itself based on NumPy.) We get hints we might be working with a different data structure when we see print statements that produce truncated listings in terms of rows and columns. We can not do our typical tricks on this structure, like converting it to a string or list, prior to standard string processing. What we first need to do is to get it into a manipulable form, such as a pandas CSV form. We need to do this for each of the four columns separately:

# Adapted from https://www.biostars.org/p/452028/

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import make_column_transformer
import numpy as np
from scipy.sparse import csr_matrix

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master.csv')
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_preflabel.csv'
out_file = open(out_f, 'w', encoding='utf8')

#cols = ['id', 'prefLabel', 'subClassOf', 'superClassOf']
cols = ['prefLabel']

df['altLabel'] = df['altLabel'].fillna(' ')              
df['superClassOf'] = df['superClassOf'].fillna(' ')

def concat_cols(df):
    assert len(df.columns) >= 2
    res = df.iloc[:, 0].astype('str')
    for key in df.columns[1:]:
       res = res + ' ' + df[key]
    return res 

tf = TfidfVectorizer().fit(concat_cols(df[['id', 'prefLabel', 'subClassOf', 'superClassOf']]))

for c in cols:                                          # Using 'prefLabel' as our example
    c_label = str(c)
    print(c_label)
    token_gen = make_column_transformer(
         (TfidfVectorizer(vocabulary=tf.vocabulary_),c_label)
         )
    tokens = token_gen.fit_transform(df)
    print(tokens)
df_2 = pd.DataFrame(data=tokens)
df_2.to_csv(out_f, encoding='utf-8')
print('File written and closed.')

Here is a sample of what such a file output looks like:

0,"  (0, 66038)	0.6551037573323365
  (0, 42860)	0.7555389249595651"
1,"  (0, 75910)	1.0"
2,"  (0, 50502)	1.0"
3,"  (0, 55394)	0.7704414465093152
  (0, 53041)	0.637510766576247"
4,"  (0, 75414)	0.4035084561691855
  (0, 35644)	0.5053985169471683
  (0, 13178)	0.47182153985726
  (0, 9754)	0.5992809853435092"
5,"  (0, 50446)	0.7232652656552668
  (0, 5844)	0.6905703117689149"
6,"  (0, 41964)	0.5266799956122452
  (0, 12319)	0.8500636342191596"
7,"  (0, 67750)	0.7206261499559882
  (0, 47278)	0.45256791509595035
  (0, 27998)	0.5252430239663488"

Inspection of the scipy.sparse.csr_matrix files shows that the frequency values are separated from the index and key by a tab separator, with sometimes multiples of values. This form can certainly be processed with Python, but we can also open the files as tab-delimited in a local spreadsheet, and then delete the frequency column to get a much simpler form to wrangle. Since this only takes a few minutes, we take this path.

This is the basis, then, that we need to clean up for our “simple” vectors, reflected in this generic routine, and we slightly change our file names to account for the difference:

import pandas as pd
import csv
import re                                 # If we want to use regex; we don't here

in_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_superclassof.csv'
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_superclassof_ok.csv'

#out_file = open(out_f, 'w', encoding='utf8')

cols = ['id', 'token']

with open(in_f, 'r', encoding = 'utf8') as infile, open(out_f, 'w', encoding='utf8', newline='') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile, delimiter=' ', quoting=csv.QUOTE_NONE, escapechar='\\')
    new_row = []
    for row in reader:
        row = str(row)
        row = row.replace('"', '')
        row = row.replace(",  (0, ",  "','")       
        row = row.replace("'  (0', ' ", "@'")      
        row = row.replace(")", "")
        row = str(row)[1 : -1]                    # A nifty for removing square brackets around a list
        new_row.append(row)                       # Transition here because we need to find-replace across rows
    new_row = str(new_row)
    new_row = new_row.replace('"', '')            # Notice how bracketing quotes change depending on target text
    new_row = new_row.replace("', @'", ",")       # Notice how bracketing quotes change depending on target text
    new_row = new_row.replace("', '", "'\n'")
    new_row = new_row.replace("''\n", "")
    print(new_row)                                                           
    writer.writerow([new_row])
print('Matrix conversion complete.')

The basic approach is to baby step a change, review the output, and plan the next substitution. It is a hack, but it is also pretty remarkable to be able to bridge such disparate data structures. The benefits from my attempts with Python are now really paying off. I’ll always be interested in figuring out more efficient ways. But the entry point to doing so is getting real stuff done.

Since our major focus here is on data wrangling, known as feature engineering in a machine learning context, it is not enough to write these routines and cursorily test outputs. Though not needed for every intermediate step, after sufficiently accumulating changes and interim files, it is advisable to visually inspect your current work product. Automated routines can easily mask edge cases that are wrong. Since we are nearly at the point of committing to the file vector representations our learners will work from, this marks a good time to manually confirm results.

In inspecting the kbpedia_master_preflabel_ok.csv example, I found about 25 of the 58 K reference concepts with either a format or representation problem. That inspection by scrolling through the entire listing looking for visual triggers took about thirty to forty minutes. Granted, those errors are less than 0.05% of our population, but they are errors nonetheless. The inspection of the first 5 instances in comparison to the master file (using the common id) took another 10 minutes or so. That led me to the hypothesis that periods (‘.’) caused labels to be skipped and other issues related to individual character or symbol use. The actual inspection and correction of these items took perhaps another thirty to forty minutes; about 40 percent were due to periods, the others to specific symbols or characters.

The id file checked out well. That took just a few minutes to fast scroll through the listing looking for visual flags. Another ten to fifteen minutes showed the subClassOf to check out as well. This file took a bit longer because every reference concept has at least one parent.

However, when checking the superClassOf file I turned up more than 100 errors. More than the other files, it took about forty minutes to check and record the errors. I feared checking these instances to resolution would take tremendous time, but as I began inspecting my initial sample all were returning as extremely long fields. Indeed, the problem that I had been seeing that caused me to flag the entry in the intial review was a colon (‘:’) in the listing, a conventional indicator in Python for a truncated field or listing. The introduced colon was the apparent cause of the problem in all concepts I sampled. I determined the likely maximum length of SuperClass entries to be about 240 characters. Fortunately, we already have a script in step 8. Final text revisions to shorten fields. We clearly overlooked those 100 instances where SuperClass entries are too long. We add this filter at Step 8 and begin to proceed to cycle through all of the routines from that point forward. It took about an hour to repeat all steps forward from there. This case validates why committing to scripts is sound practice.

13. An alternate TF/IDF encode

Our earlier experience with TF-IDF and CountVectorizer was not the easiest to set up. I wanted to see if there were perhaps better or easier ways to conduct a TF-IDF analysis. We still had the definition and altLabel columns to encode.

In my initial diligence I had come across a wrapper around many interesting text functions. It is called Texthero and it provides a consistent interface over NLTK, SpaCy, Gensim, TextBlob and sklearn. It provides common text wrangling utilities, NLP tools, vector representations, and results visualization. If your problem area falls within the scope of this tool, you will be rewarded with a very straightforward interface. However, if you need to tweak parameters or modify what comes out of the box, Texthero is likely not the tool for you.

Since TF-IDF is one of its built-in capabilities, we can show the simple approach available:

import pandas as pd
import texthero as hero

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_definition_bigram.csv')
out_f = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_definition.csv'

#df['altLabel'] = df['altLabel'].fillna(' ') 
texts = df['def_bigram']
 
df['def_tfidf'] = hero.tfidf(texts, min_df=1, max_features=1000)

df.drop('Unnamed: 0', axis=1, inplace=True)
df = df[['id', 'def_tfidf']]
df.to_csv(out_f, encoding='utf-8')
print('TF/IDF calculations completed.')
print('File written and closed.')
TF/IDF calculations completed.
File written and closed.

We apply this code to both the altLabel and definition fields with different parameters: (df=1, 500 features) and (df=3, 1000 features) for these fields, respectively. Since these routines produce large files that can not be easily viewed, we write them out with only the frequencies and the id field as the mapping key.

14. Prepare doc2vec vectors from pages for master

We discussed word and document embedding in the previous installment. We’re attempting to capture a vector representation of the Wikipedia page descriptions for about 31 K of the 58 K reference concepts (RCs) in current KBpedia. We found doc2vec to be a promising approach. Our prior installment had us representing our documents with about 1000 features. We will retain these earlier settings and build on last results.

We have earlier built our model, so we need to load it, read in our source files, and calculate our doc2vec values per input line, that is, per RC with a Wikipedia article. To also output strings readable by the next step in the pipeline, we also need to do some formatting changes, including find and replaces for line feeds and carriage returns. As we process each line, we append it to an array (which becomes a list in Python) that will update our initial records with the new calculated vectors into a new column (‘doc2vec’):

from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument
import pandas as pd

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\kbpedia-pages.csv'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\kbpedia-d2v.csv'

df = pd.read_csv(in_f, header=0, usecols=['id', 'prefLabel','doc2vec']) 

src = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-d2v.model'
model = Doc2Vec.load(src)

doc2vec = []

# For each row in df.id
for i in df['id']:
    array = model.docvecs[i]
    array = str(array)
    array = array.replace('\r', '')
    array = array.replace('\n', '')
    array = array.replace('  ', ' ')
    array = array.replace('  ', ' ')
    count = i + 1
    doc2vec.append(array)

df['doc2vec'] = doc2vec

df.to_csv(out_f)
print('Processing done with ', count, 'records')

One of the things we notice as we process this information is that I have been inconsistent in the use of ‘id’, especially since it has emerged to be the primary lookup key. Looking over the efforts to date I see that sometimes ‘id’ is the numeric sequence ID, sometimes it is the unique URI fragment (the standard within KBpedia), and sometimes it is a Wikipedia URI fragment with underscores for spaces. The latter is the basis for the existing doc2vec files.

Conformance with the original sense of ‘id’ means to use the URI fragment that uniquely identifies each reference concept (RC) in KBpedia. This is the common sense I want to enforce. It has to be unique within KBpedia, and therefore is a key that any contributing file should reference in order to bring its information into the system. I need to clean up the sloppiness.

This need for consistency forces me to use the merge steps noted under Step 5 above to map the canonical ‘id’ (the KBpedia URI fragment) to the Wikipedia IDs used in our mappings. Again, scripts are our friend and we are able to bring this pages file into compliance without problems.

After this replacing, we have now completed the representation of our input information into machine learner form. It is time for us to create the vector lookup file.

Consolidate a New Vector Master File

OK, we now have successfully converted all categorical and text and non-numeric information into numeric forms readable by our learners. We need to consolidate this information since it will be the starting basis for our next machine learning efforts.

In a production setting, all of these individual steps and scripts would be better organized into formal programs and routines, best embedded within the the pipeline construct of your choice (though having diverse scopes, gensim, sklearn, and spaCy all have pipeline capabilities).

To best stage our machine learning tests to come, I decide to create a parallel master file, only this one using vectors rather than text or categories as its column contents. We want the column structure to roughly parallel the text structure, and we also decide to keep the page information separate (but readily incorporable) to keep general file sizes manageable. Again, we use id as our master lookup key, specifically referring to the unique URI fragment for each KBpedia RC.

15. Merge page vectors into master

Under Step 5 above we identified and wrote a code block for merging information from two different files based on the lookup concurrence of keys between the source and target. Care is appropriately required that key matches and cardinality be respected. It is important to be sure the matching keys from different files have the right reference and format to indeed match.

Since we like the idea of a master ‘accumulating’ file to which all contributed files map, we use the left join method of the inner merge we first described under Step 5, continually using the master ‘accumulating’ file as our left join basis. We name the vector version of our master file kbpedia_master_vec.csv, and we proceed to merge all prior files with vectors into it. (We will not immediately merge the doc2vec pages file since it is quite large; we will only do this merge as needed for the later analysis.)

As we’ve noted before, we want to proceed in column order from left to right based on our column order in the earlier master. Here is the general routine with some commented out lines used on occasion to clean up the columns to be kept or their ordering:

import pandas as pd

file_a = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec.csv'
file_b = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_cleanst.csv'
file_out = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec1.csv'

df_a = pd.read_csv(file_a)

df_b = pd.read_csv(file_b, engine='python', encoding='utf-8', error_bad_lines=False)

merged_inner = pd.merge(left=df_a, right=df_b, how='outer', left_on='id_x', right_on='id')
# Since both sources share the 'id' column name, we could skip the 'left_on' and 'right_on' parameters

# What's the size of the output data?
merged_inner.shape
merged_inner

merged_inner.to_csv(file_out)
print('Merge complete.')
Merge complete.

After each merge, we remove the extraneous columns by writing to file the columns we want to keep. We can also directly drop columns and do other activities such as rename. We may also need to change the datatype of a column because of default parameters in the pandas routines.

The code block below shows some of these actions, with others commented out. The basic workflow, then, is to invoke the next file, make sure our merge conditions are appropriate to the two files undergoing a merge, save to file with a different file name, inspect those results, and make further changes until the merge is clean. Then, we move on to the next item requiring incorporation and rinse and repeat.

import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec1.csv')
file_out = r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec.csv'

df = df[['id_x', 'id_token', 'pref_token', 'sub_token', 'count', 'super_token', 'alt_tfidf', 'def_tfidf']]
#df.rename(columns={'id_x': 'id'})
#df.drop('Unnamed: 0', axis=1, inplace=True)
#df.drop('Unnamed: 0.1', axis=1, inplace=True)
df.info()
df.to_csv(file_out
print('File written.')
df 

We can also readily inspect the before and after files to make sure we are getting the results we expect:

import pandas as pd

df = pd.read_csv(r'C:/1-PythonProjects/kbpedia/v300/models/inputs/kbpedia_master_vec.csv')

df.info()

df 

It is important to inspect results as this process unfolds and to check each run to make sure the number of records has not grown. If it does grow, that is due to problems in the input files of some manner that is causing the merges not to be clean, which can add more rows (records) to the merged entity. For example, one problem I found were duplicate reference concepts (RCs) that varied because of differences in capitalization (especially when merging on the basis of the KBpedia URI fragments). That caused me to reach back quite a few steps to correct the input problem. I have also flagged doing a more thorough check for nomimal duplicates for the next version release of KBpedia.

Other items causing processing problems may include punctuation or errors due to such in earlier processing steps. One of the reasons I kept the id field from both files in the first step of these incremental merges was to have a readable basis for checking proper registry and to identify possible problem concepts. Once the checks were complete, I could delete the extraneous id column.

The result of this incremental merging and assembly was to create the final kbpedia_master_vec.csv file, which we will have much occasion to discuss in next installments. The column structure of this final vector file is:

id id_token pref_token sub_token count super_token alt_tfidf def_tfidf st_AVInfo st_ActionTypes plus 66 more STs

Now Ready for Use

We are now ready to move to the use of scikit-learn for real applications, which we address in the next CWPK installment.

Additional Documentation

Here is some additional documentation the provides background to today’s installment.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.

NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.

I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on November 23, 2020 at 9:38 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2418/cwpk-65-scikit-learn-basics-and-initial-encoding/
The URI to trackback this post is: https://www.mkbergman.com/2418/cwpk-65-scikit-learn-basics-and-initial-encoding/trackback/
Posted:November 12, 2020

Some Machine Learning Applied to NLP Problems

In the last installment of Cooking with Python and KBpedia we collected roughly 45,000 articles from Wikipedia that match KBpedia reference concepts. We then did some pre-processing of the text using the gensim package to lower case the text, remove stop words, and identify bi-gram and tri-gram phrases. These types of functions and extractions are a strength of gensim, which should be part of your pre-processing arsenal.

It is now time for us to process the natural language of this text and to use it for creating word and document embedding models. For the later, we will continue to use gensim. For the former, however, we will introduce and use a very powerful NLP package, spaCy. As noted earlier, spaCy has a clear function orientation to NLP and also has some powerful extension mechanisms.

Our plan of attack in this installment is to finish the word embeddings with gensim, and then move on to explore the spaCy package. We will not explore all aspects of this package, but will focus on text summarization, and (named) entity recognition using both models and rule-based.

Word and Document Embedding

As we have noted in CWPK #61, there exist many pre-calculated word and document embedding models. However, because of the clean scope of KBpedia and our interest in manipulating embeddings for various learning models, we want to create our own embeddings.

There are many scopes and methods for creating embeddings. Embeddings, you recall, are a vector representation of information in a reduced dimensional space. Embeddings are a proven way to represent sparse matrix information like language (meaning many dimensions of words and phrases matched to one another) in a more efficient coding format usable by a computer. Embedding scopes may range from words, phrases, sentences, paragraphs, sections of documents, or documents, as well as senses, topics, sentiments, categories or other relations that might cut across a given corpus. Methods may range from sequences to counts to simple statistics or all the way up to deep learning with neural nets. Of late, a combination method of converging encoders and decoders called ‘transformers’ has been the rage, with BERT and ELMo two prominent instantiations.

Because we already have been exercising the gensim package, we decide to proceed with our own word embedding and document embedding models. From gensim documentation, we first prepare up a word2vec model:

NOTE: Due to GitHub’s file size limits, the various text file inputs referenced in this installment may be found on the KBpedia site as zipped files (for example, https://kbpedia.org/cwpk-text/wikipedia-trigram.zip for the input file mentioned next). Due to their very large sizes, you will need to create locally all of the models mentioned in this installment (with *.vec or *.model extensions).
import sys
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-trigram.txt'
out_model = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-w2v.model'
out_vec = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-w2v.vec'

input = smart_open(in_f, 'r', encoding='utf-8')
walks = LineSentence(input)
model = Word2Vec(walks, size=300, window=5, min_count=3, hs=1, sg=1, workers=5, negative=5, iter=10)
model.save(out_model)
model.wv.save_word2vec_format(out_vec, binary=False)

This works pretty slick and only requires a few lines of code. The model run takes about 2:15 hrs on my laptop; to process the entire Wikipedia English corpus reportedly takes about 9 hrs. Note we begin the training process with our tri-gram input corpus created in the last installment.

A few of the model parameters deserve some commentary. The size parameter is one of the most important. It sets the number of dimensions over which you want to capture a correspondence statistic, what is the actual dimension reduction at the core of the entire exercise. Remember, a collocation matrix is a very sparse one for natural language. In the case of how I have set up the Wikipedia pages from KBpedia so far with stoplist and trigrams and such, our current corpus has 1.3 million tokens, which is really sparse when you extend the second dimension by this same amount. The size parameter beyond hundreds of dimensions works to greatly increase the computation time in training as well as (perhaps paradoxically) lowering accuracy. The window parameter is the word count to either side of the current token for which adjacency is calculated, so that a window of five actually encompasses a string of eleven tokens, the subject token and five to either side. min-counts is the minimum number of occurrences for a given token (including phrases or ngrams as individual tokens). sg in this case is invoking the ‘skip-gram’ method as opposed to the second method more commonly used, the ‘cbow’ (continuous bag of words) method.

Like any central Python function, you should study this one to learn more about some of the other settable parameters. What is most important, however, is to learn about these settings, test those you deem critical, and realize fine-tuning such parameters is likely the key to successful results with your machine learning efforts. It is a common secret that success with machine learning is dependent on setting up and then tweaking the parameters that go into any particular method.

We can take this same code block above and set up the doc2vec method:

import sys
from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument
from gensim.models import Word2Vec
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-trigram.txt'
out_model = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-d2v.model'
out_vec = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-d2v.vec'
input = smart_open(in_f, 'r', encoding='utf-8')

documents = TaggedLineDocument(input)
training_src = list(documents)
print(training_src[:1])
model = Doc2Vec(vector_size=300, min_count=15, epochs=30)
model.build_vocab(training_src)
model.train(training_src, total_examples=model.corpus_count, epochs=model.epochs)
model.save(out_model)
model.save_word2vec_format(out_vec, binary=False)
print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))

The doc2vec method has a similar setup. The main difference is that the vector calculation is now based on full sentences versus individual words. We also increase the min_count parameter. We’ll see the results of this training in the next section.

gensim also has methods to train FastText. Please consult the documentation for this method as well as to understand better the various training parameters.

Similarity Analysis

A good way to see the effect of embedding vectors is through similarity analysis. The calculations are based on the adjacency of vectors in the embedding space.

Our first two examples use word2vec for our newly created KBpedia-Wikipedia corpus. The first example calculates the relatedness between two entered terms:

from gensim.models import Word2Vec

path = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-w2v.model'
model = Word2Vec.load(path)
model.wv.similarity('man', 'woman')
0.6882485

The second example retrieves the most closely related terms given an input term or phrase (in this case, machine_learning:

from gensim.models import Word2Vec

path = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-w2v.model'
model = Word2Vec.load(path)
w1 = ['machine_learning']
model.wv.most_similar(positive=w1, topn=6)
[('artificial_intelligence', 0.7738381624221802),
 ('data_mining', 0.7659739255905151),
 ('algorithms', 0.7430499792098999),
 ('natural_language_processing', 0.7429415583610535),
 ('computational', 0.7116029262542725),
 ('computational_linguistics', 0.6903550028800964)]

gensim offers a number of settings including whether one can analyze without training (effectively a ‘read only’ option) and other parameters including number of results returned, etc.

We can also compare the doc2vec approach in comparison to word2vec:

from gensim.models import Doc2Vec

path = r'C:\1-PythonProjects\kbpedia\v300\models\results\wikipedia-d2v.model'
model = Doc2Vec.load(path)
w1 = ['machine_learning']
model.wv.most_similar(positive=w1, topn=6)
[('quantitative_methods', 0.4042782783508301),
 ('artificial_intelligence', 0.3983246088027954),
 ('evolutionary_computation', 0.39264559745788574),
 ('information_retrieval', 0.38776731491088867),
 ('natural_language_processing', 0.38531848788261414),
 ('deep_learning', 0.37803560495376587)]

Note we get a similar listing of results, though the correlation scores in this doc2vec case are much lower.

These efforts conclude our embedding tests for the moment. We will be adding additional embeddings based on knowledge graph structure and annotations in CWPK #67.

Text Summarization

Let’s now switch gears and introduce our basic natural language processing package, spaCy. Out-of-the-box spaCy includes the standard NLP utilities of part-of-speech tagging, lemmatization, dependency parsing, named entity recognition, entity linking, tokenization, merging and splitting, and sentence segmentation. Various vector embedding or rule-based processing methods may be layered on top of these utilities, and they may be combined into flexible NLP processing pipelines.

We are not doing anything special here, but I wanted to include text summarization because it nicely combines many functions and utilities provided by the spaCy package. Here is an example using an existing spaCy model, en_core_web_sm, which has pre-calculated POS and NER tags based on the English OntoNotes 5 corpus. (You will need to separately download and install these existing models.) The text to be evaluated was copied-and-pasted from CWPK #61:

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
import en_core_web_sm

extra_words = list(STOP_WORDS) + list(punctuation) + ['\n']
nlp = en_core_web_sm.load()
doc = """With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, "data science") discussion, and the last part where significant code is developed and documented. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.
      Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.
      KBpedia's (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate 'slices' or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.
      Second, 80% of KBpedia's concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic 'signals'. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.
      And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The 'deep' appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.
      The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at 'standard' machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.
      The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.
      So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python 'ecosystems' and 'frameworks' in this part is to be better prepared to incorporate those innovations and learnings to come.
      One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.
 Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:  There are many possible diagrams that one might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of 'classification' is a supervised one, 'clustering' a notion of unsupervised.
 will include a 'standard' machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.
 All machine learners need to operate on their feature spaces in numerical representations. Text is a tricky form because language is difficult and complex, and how to represent the tokens within our language usable by a computer needs to consider, what? Parts-of-speech, the word itself, sentence construction, semantic meaning, context, adjacency, entity recognition or characterization? These may all figure into how one might represent text. Machine learning has brought us unsupervised methods for converting words to sentences to documents and, now, graphs, to a reduced, numeric representation known as "embeddings." The embedding method may capture one or more of these textual or structural aspects.
 Much of the first interest in machine learning based on graphs was driven by these interests in embeddings for language text. Standard machine classifiers with deep learning using neural networks have given us word2vec, and more recently BERT and its dozens of variants have reinforced the usefulness of deep learning to create pre-trained text representations.
 Indeed, embeddings do figure prominently in knowledge graph representation, but only as one among many useful features. Knowledge graphs with hierarchical (subsumption) relationships, as might be found in any taxonomy, become directed. Knowledge graphs are asymmetrical, and often multi-typed and sometimes multi-modal. There is heterogeneity among nodes and links or edges. Not all knowledge graphs are created equal and some of these aspects may not apply. Whether there is an accompanying richness of text description that accompanies the node or edges is another wrinkle. None of the early CNN or RNN or simple neural net approaches match well with these structures.
 The general category that appears to have emerged for this scope is geometric deep learning, which applies to all forms of graphs and manifolds. There are other nuances in this area, for example whether a static representation is the basis for analysis or one that is dynamic, essentially allowing learning parameters to be changed as the deep learning progresses through its layers. But GDL has the theoretical potential to address and incorporate all of the wrinkles associated with heterogeneous knowledge graphs.
      So, this discussion helps define our desired scope. We want to be able to embrace Python packages that range from simple statistics to simple machine learning, throwing in natural language processing and creating embedding representations, that can then range all the way through deep learning to the cutting-edge aspects of geometric or graph deep learning.
      This background provides the necessary context for our investigations of Python packages, frameworks, or libraries that may fulfill the data science objectives of this part. Our new components often build upon and need to play nice with some of the other requisite packages introduced in earlier installments, including pandas ([CWPK #55]), NetworkX ([CWPK #56]), and PyViz ([CWPK #55]). NumPy has been installed, but not discussed.
      It is not fair to say that natural language processing has become a 'commodity' in the data science space, but it is also true there is a wealth of capable, complete packages within Python. There are standard NLP requirements like text cleaning, tokenization, parts-of-speech identification, parsing, lemmatization, phrase identification, and so forth. We want these general text processing capabilities since they are often building blocks and sometimes needed in their own right. We also would like to add to this baseline such considerations as interoperability, creating embeddings, or other special functions.
 Another key area is language embedding. Language embeddings are means to translate language into a numerical representation for use in downstream analysis, with great variety in what aspects of language are captured and how to craft them. The simplest and still widely-used representation is tf-idf (term frequency–inverse document frequency) statistical measure. A common variant after that was the vector space model. We also have latent (unsupervised) models such as LDA. A more easily calculated option is explicit semantic analysis (ESA). At the word level, two of the prominent options are word2vec and gloVe, which is used directly in spaCy. These have arisen from deep learning models. We also have similar approaches to represent topics (topicvec), sentences (sentence2vec), categories and paragraphs (Category2Vec), documents (doc2vec), node2vec or entire languages (BERT and variants and GPT-3 and related methods). In all of these cases, the embedding consists of reducing the dimensionality of the input text, which is then represented in numeric form.
 There are internal methods for creating embeddings in multiple machine learning libraries. Some packages are more dedicated, such as fastText, which is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. Another option is TextBrewer, which is an open-source knowledge distillation toolkit based on PyTorch and which uses (among others) BERT to provide text classification, reading comprehension, NER or sequence labeling.
 Closely related to how we represent text are corpora and datasets that may be used either for reference or training purposes. These need to be assembled and tested as well as software packages. The availability of corpora to different packages is a useful evaluation criterion. But, the picking of specific corpora depends on the ultimate Python packages used and the task at hand. We will return to this topic in CWPK #63.
 Of course, nearly all of the Python packages mentioned in this Part VI have some relation to machine learning in one form or another. I call out this category separately because, like for NLP, I think it makes sense to have a general machine learning library not devoted to deep learning but providing a repository of classic learning methods.
 There really is no general option that compares with scikit-learn. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means and DBSCAN data clustering, and is designed to interoperate with NumPy and SciPy. The project is extremely active with good documentation and examples.
 Deep learning is characterized by many options, methods and philosophies, all in a fast-changing area of knowledge. New methods need to be compared on numerous grounds from feature and training set selection to testing, parameter tuning, and performance comparisons. These realities have put a premium on libraries and frameworks that wrap methods in repeatable interfaces and provide abstract functions for setting up and managing various deep (and other) learning algorithms.
 The space of deep learning thus embraces many individual methods and forms, often expressed through a governing ecosystem of other tools and packages. These demands lead to a confusing and overlapping and non-intersecting space of Python options that are hard to describe and comparatively evaluate. Here are some of the libraries and packages that fit within the deep and machine learning space, including abstraction frameworks:
 Theano is a Python library and optimizing compiler for manipulating and evaluating mathematical expressions, especially matrix-valued ones; it is tightly integrated with NumPy, and uses it at the lowest level.
 Keras is increasingly aligning with TensorFlow and some, like Chainer and CNTK, are being deprecated in favor of the two leading gorillas, PyTorch and TensorFlow. One approach to improve interoperability is the Open Neural Network Exchange (ONNX) with the repository available on GitHub. There are existing converters to ONNX for Keras, TensorFlow, PyTorch and scikit-learn.
 A key development from deep learning of the past three years has been the usefulness of Transformers, a technique that marries encoders and decoders converging on the same representation. The technique is particularly helpful to sequential data and NLP, with state-of-the-art performance to date for: next-sentence prediction, question answering, reading comprehension, sentiment analysis, and paraphrasing.
      Both BERT and GPT are pre-trained products that utilize this method. Both TensorFlow and PyTorch contain Transformer capabilities.
 As noted, most of my research for this Part VI has resided in the area of a subset of deep graph learning applicable to knowledge graphs. The leading deep learning libraries do not, in general, provide support for this area of representational learning, sometimes called knowledge representation learning (KRL) or knowledge graph embedding (KGE). Within this rather limited scope, most options also seem oriented to link prediction and knowledge graph completion (KGC), rather than the heterogeneous aspects with text and OWL2 orientation characteristic of KBpedia.
 We thus see this rough hierachy: machine learning → deep learning → geometric deep learning → graph (R) learning → KG learning
 Lastly, more broadly, there is the recently announced KGTK, which is a generalized toolkit with broader capabilities for large scale knowledge graph manipulation and analysis. KGTK also puts forward a standard KG file format, among other tools.
 A Generalized Python Data Science Architecture
 With what we already have in hand, plus the libraries and packages described above, we have a pretty good inventory of candidates to choose from in proceeding with our next installments. Like our investigations around graphics and visualization (see [CWPK #55]), the broad areas of data science, machine learning, and deep learning have been evolving to one of comprehensive ecosystems. Figure 2 below presents a representation of the Python components that make sense for the machine learning and application environment. As noted, our local Windows machines lack separate GPUs (graphical processing units), so the hardware is based on a standard CPU (which has an integrated GPU that can not be separately targeted). We have already introduced and discusses some of the major Python packages and libraries, including pandas, NetworkX, and PyViz. Here is that representative data science architecture:
 Representative Python Components
 The defining architectural question for this Part VI is what general deep and machine learning framework we want (if any). I think using a framework makes sense over scripting together individual packages, though for some tests that still might be necessary. If I was to adopt a framework, I would also want one that has a broad set of tools in its ecosystem and common and simpler ways to define projects and manage the overall pipelines from data to results. As noted, the two major candidates appear to be TensorFlow and PyTorch.
 TensorFlow has been around the longest, has, today, the strongest ecosystem, and reportedly is better for commercial deployments. Google, besides being the sponsor, uses TensorFlow in most of its ML projects and has shown a commitment to compete with the upstart PyTorch by significantly re-designing and enhancing TensorFlow 2.0.
 On the other hand, I very much like the more 'application' orientation of PyTorch. Innovation has been fast and market share has been rising.
 Though some of the intriguing packages for TensorFlow are not apparently available for PyTorch, including Graph Nets, Keras, Plaid ML, and StellarGraph, PyTorch does have these other packages not yet mentioned that look potentially valuable down the road:
      One disappointment is that neither of these two leading packages directly ingest [RDFLib] graph files, though with PyTorch and DGL you can import or export a NetworkX graph directly. pandas is also a possible data exchange format.
      Consideration of all of these points has led us to select PyTorch as the initial data science framework. It is good to know, however, that a fairly comparable alternative also exists with TensorFlow and Keras.
      Finally, with respect to Figure 2 above, we have no plans at present to use the Dask package for parallelizing analytic calculations.
      With the PyTorch decision made, at least for the present, we are now clear to deal with specific additional packages and libraries. I highlight four of these in this section. Each of these four is the focus of two separate installments as we work to complete this Part VI. One of these four is in natural language processing (spaCy), one in general machine learning (scikit-learn), and two in deep learning with an emphasis on graphs (DGL and DGL-KE, and PyG). These choices again tend to reinforce the idea of evaluating whole ecosystems, as opposed to single packages. Note, of course, that more specifics on these four packages will be presented in the forthcoming installments.
      I find spaCy to be very impressive, with many potentially useful extensions including sense2vec, spacy-stanza, spacy-wordnet, torchtext, and GenSim.
      The major competitor is NLTK. The reputation of this package is stellar and it has proven itself for decades. It is a more disaggregate approach by scholars and researchers to enable users to build complex NLP functionality. It is therefore harder to use and configure, and is also less performant. The real differentiator, however, is the more object or application orientation of spaCy.
      Though NLTK appears to have good NLP tools for processing data pipelines using text, most of these functions appear to be in spaCy and there are also the Flair and PyTorch-NLP packages available in the PyTorch environment if needed. GenSim looks to be a very useful enhancement to the environment because of the advanced text evaluation modes it offers, including sentiment. Not all of these will be tested during this CWPK series, but it will be good to have these general capabilities resident in cowpoke.
      We earlier signaled our intent to embrace scikit-learn, principally to provide basic machine learning support. scikit-learn provides a unified API to these basic tasks, including crafting pipelines and meta-functions to integrate the data flow. scikit-learn works on any numeric data stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.
      Some of the general ML methods, and there are about 40 supervised ones in the package, may be useful and applicable to specific circumstances include: dimensionality reduction, model testing, preprocessing, scoring methods, and principal component analysis (PCA).
      A real test of this package will be ease of creating (and then being able to easily modify) data processing and analysis pipelines. Another test will be ingesting, using, and exporting data formats useful to the KBpedia knowledge graph. We know that scikit-learn doesn't talk directly to NetworkX, though there may be recipes for the transfer; graphs are represented in scikit-learn as connectivity matrices. pandas can interface via common formats including CSV, Excel, JSON and SQL, or, with some processing, DataFrames. scikit-learn supports data formats from NumPy and SciPy, and it supports a datasets.load_files format that may be suitable for transferring many and longer text fields. One option that is intriguing is how to leverage the CSV flat-file orientation of our KG build and extract routines in cowpoke for data transfer and transformation.
      I also want to keep an eye on the possible use of skorch to better integrate with the overall PyTorch environment, or to add perhaps needed and missing functionality or ease of development. There is much to explore with these various packages and environments.
      For our basic, 'vanilla', deep graph analysis package we have chosen the eponymous Deep Graph Library for basic graph neural network operations, which may run on CPU or GPU machines or clusters. The better interface relevant to KBpedia is through DGL-KE, a high performance, reportedly easy-to-use, and scalable package for learning large-scale knowledge graph embeddings that extends DGL. DGL-KE also comes configured with the popular models of TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE.
      PyTorch Geometric is closely tied to PyTorch, and most impressively has uniform wrappers to about 40 state-of-art graph neural net methods. The idea of "message passing" in the approach means that heterogeneous features such as structure and text may be combined and made dynamic in their interactions with one another. Many of these intrigued me on paper, and now it will be exciting to test and have the capability to inspect these new methods as they arise. DeepSNAP may provide a direct bridge between NetworkX and PyTorch Geometric.
      During the research on this Part VI I encountered a few leads that are either not ready for prime time or are off scope to the present CWPK series. A potentially powerful, but experimental approach that makes sense is to use SPARQL as the request-and-retrieval mechanism against the graph to feed the machine learners. RDFFrames provides an imperative Python API that gets internally translated to SPARQL, and it is integrated with the PyData machine learning software stack; see GitHub. Some methods above also use SPARQL. One of the benefits of a SPARQL approach, besides its sheer query and inferencing power, is the ability to keep the knowledge graph intact without data transform pipelines. It is quite available to serve up results in very flexible formats. The relative immaturity of the approach and performance considerations may be difficult challenges to overcome.
      I earlier mentioned KarateClub, a Python framework combining about 40 state-of-the-art unsupervised graph mining algorithms in the areas of node embedding, whole-graph embedding, and community detection. It builds on the packages of NetworkX, PyGSP, Gensim, NumPy, and SciPy. Unfortunately, the package does not support directed graphs, though plans to do so have been stated. This project is worth monitoring.
      A third intriguing area involves the use of quaternions based on Clifford algebras in their machine learning codes. Charles Peirce, the intellectual guide for the design of KBpedia, was a mathematician of some renown in his own time, and studied and applauded William Kingdon Clifford and his emerging algebra as a contemporary in the 1870s, shortly before Clifford's untimely death. Peirce scholars have often pointed to this influence in the development of Peirce's own algebras. I am personally interested in probing this approach to learn a bit more of Peirce's manifest interests.
      """
docx = nlp(doc)
all_words = [word.text for word in docx]
Freq_word = {}
for w in all_words:
    w1 = w.lower()
    if w1 not in extra_words and w1.isalpha():
        if w1 in Freq_word.keys():
              Freq_word[w1] += 1
        else:
              Freq_word[w1] = 1

These spaCy models have to be separately loaded. In our case, we first needed to download and install the model package:

Then, we needed to import the model as shown. To use a different model one would need to import and load that model separately.

Let’s go ahead and run this routine to get the word frequencies in the input text:

Freq_word

We can also get an estimate of the overall topic for our input text:

val=sorted(Freq_word.values())
max_freq = val[-3:]
print('Topic of document given :-')
for word,freq in Freq_word.items():
    if freq in max_freq:
        print(word ,end = ' ')
    else:
        continue
Topic of document given :-
machine learning deep 

We can now proceed to begin our text summarization by scoring the relevance of the sentences in the input text:

for word in Freq_word.keys():
    Freq_word[word] = (Freq_word[word] / max_freq[-1])
sent_strength = {}
for sent in docx.sents:
    sen_len = len(sent)
    if sen_len >= 8:
        for word in sent :
            if word.text.lower() in Freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent] += Freq_word[word.text.lower()]
                else:
                    sent_strength[sent] = Freq_word[word.text.lower()]
            else: 
                continue
    else:
        sent_strength[sent] = 0
top_sentences = (sorted(sent_strength.values())[::-1])
top5percent_sentence = int(0.05 * len(top_sentences))
top_sent = top_sentences[:top5percent_sentence]                

And then, based on those scores, to generate a summary based on the top 5 percent of sentences in the input text:

summary = []
for sent,strength in sent_strength.items():
    if strength in top_sent:
        summary.append(sent)
    else:
        continue
for i in summary:
    print(i, end='')
And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning.We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries.Standard machine classifiers with deep learning using neural networks have given us word2vec, and more recently BERT and its dozens of variants have reinforced the usefulness of deep learning to create pre-trained text representations.
      We want to be able to embrace Python packages that range from simple statistics to simple machine learning, throwing in natural language processing and creating embedding representations, that can then range all the way through deep learning to the cutting-edge aspects of geometric or graph deep learning.
      I call out this category separately because, like for NLP, I think it makes sense to have a general machine learning library not devoted to deep learning but providing a repository of classic learning methods.
      The leading deep learning libraries do not, in general, provide support for this area of representational learning, sometimes called knowledge representation learning (KRL) or knowledge graph embedding (KGE).→ deep learning → geometric deep learning → graph (R) learningLike our investigations around graphics and visualization (see [CWPK #55]), the broad areas of data science, machine learning, and deep learning have been evolving to one of comprehensive ecosystems.One of these four is in natural language processing (spaCy), one in general machine learning (scikit-learn), and two in deep learning with an emphasis on graphs (DGL and DGL-KE, and PyG).

We can vary the size of the summarization by varying the percentage of sentences to be included. Note there are other methods available for summarizing text using spaCy. A standard search will turn up other methods than simply calculating top sentences based on word frequencies.

(Named) Entity Recognition – Update Model

Depending on the input model, spaCy provides pre-trained entity recognition models (the area is most often refered to as NER, but an actual category of entities need not be named with capitalization, which is why I prefer ‘entity recognition’). In the case of the pre-trained English OntoNotes 5 model, these entity tags are:

Type Description
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage, including ‘%’.
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL “first”, “second”, etc.
CARDINAL Numerals that do not fall under another type.

We would like to add our own new entity tag to this list for items related to ‘machine learning’, which we will give the tag of ‘ML’. spaCy provides two methods for extending entity recognition: 1) training and updating an existing model (in this case, en_core_web_sm); or 2) a rule-based approach.

The first option we will try is the updated model. Another major section below investigates the rule-based approach.

Get Entity Pages

We already have a number of ‘machine learning’-related reference concepts in KBpedia. Given our success in using the Wikipedia online API to get articles (see prior installment), I decide to explore that API to see if there are ways to get comprehensive listings of ‘machine learning’ topics. Happily, it turns out, there are!

This particular API call, https://www.mediawiki.org/wiki/API:Categorymembers, enables us to enter a category term, in this case ‘machine learning’, and to get all of the article titles subsumed in that category. What is also fantastic is that we can also get various code snippets to interact with this API, including using Python. Here is the code listing that we get from the API:

#!/usr/bin/python3

"""
    get_category_items.py

    MediaWiki API Demos
    Derived from demo of `Categorymembers` module : List twenty items in a category

    MIT License
"""

import requests

S = requests.Session()

URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "query",
    "cmtitle": "Category:Unsupervised learning",
    "cmlimit": "300",
    "list": "categorymembers",
    "format": "json"
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()

PAGES = DATA['query']['categorymembers']

for page in PAGES:
    print(page['title'])

For each appropriate sub-category under ‘machine learning’ we issue the query above to get a listing of possible articles on the topic in Wikipedia. We assemble up this listing and then manually inspect it to remove things like Category:Machine learning researchers, since those are people related to machine learning but not ‘machine learning’ per se. The result of this retrieval, including all relevant subdirectories, yields 1027 pages, 853 of which are unique, and 846 of which actually process.

This listing gives us two kinds of items. First, the page titles give us terms and phrases related to ‘machine learning’. Second, using the same procedures for Wikipedia page retrievals noted in CWPK #63, we retrieve the actual XML pages, clean them in the same way we did for the general KBpedia corpus, and then create bigrams and trigrams. We now have our specialty ‘machine learning’ corpus in the exact same format as that for KBpedia, which we keep and save as wp_machine_learning.txt.

Chunk Text into ‘Sentences’

Since the spaCy NER trainer relies on sentence-length snippets for its training (see below), our next step is to chop up this text corpus into sentence-length chunks. The code below, including the textwrap package import, enables us to iterate document-by-document through our wp_machine_learning.txt file and to break it into sentence-length chunks. The example below chunks into snippets that are 48 characters long:

import textwrap
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning_sentences_48.txt'

documents = smart_open(in_f, 'r', encoding='utf-8')

with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in documents:
        try:
            line = str(line)
            sentences = textwrap.wrap(line, 48)
            sentences = str(sentences)
            sentences = sentences + '\n'
            output.write(sentences)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
    output.close()
    print('Split lines into sentences for ' + str(i) + ' articles;')
    print('Processing complete!')   

In my various experiments, I set chunk sizes ranging from 48 characters to 180 characters, as I discuss in later results.

Extract and Offset Entities

The titles we extracted from Wikipedia ‘machine learning’ articles give us the set terms and phrases for setting up the labeled examples expected by the spaCy NER trainer. Here is one example:

"a generative model it is now known as a", {"entities": [(2, 18, "ML")]}

This labeled example shows a text pattern in which the entity (‘generative model’ in this case) is embedded, with the starting and ending character offsets specified for that entity, as well as its ‘ML’ label. (Note that the offset counter begins at zero given the Python convention.) One needs to provide hundreds of such labeled examples to properly train the entity recognizer.

Manually labeling these snippets is an intense and time-consuming task. To make it efficient, we use a spaCy function called PhraseMatcher that inspects each snippet, identifies whether a stipulated entity occurs there, and returns the offset character number where it matches. Thus, in the code below, we first list out the 800 or so ‘machine learning’ titles we have already identified, and then parse those against the sentence snippets we generated from our article texts:

# adapted from https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types
# https://thinkinfi.com/prepare-training-data-and-train-custom-ner-using-spacy-python/
# https://adagrad.ai/blog/ner-using-spacy.html

import os
import spacy
import random
from spacy.matcher import PhraseMatcher
from spacy.tokenizer import Tokenizer
from smart_open import smart_open
import en_core_web_sm

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning_sentences_48.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning_training_data_48.json'
documents = smart_open(in_f, 'r', encoding='utf-8')
nlp = en_core_web_sm.load()

ml_patterns = [nlp(text) for text in ('(1+ε)-approximate nearest neighbor search', 
     '80 million tiny images', 'ablation', 'absorbing markov chain')]
# See the full listing under '(Named) Entity Recognition - Rule-Based' section below                                     


matcher = PhraseMatcher(nlp.vocab)
matcher.add('ML', None, *ml_patterns)
with open(out_f, 'w', encoding='utf-8') as output:
    x = 0
    for line in documents:
        line = str(line)
        sublist = line.split(', ')
        for sentence in sublist:
            sentence = str(sentence)
            sentence = sentence.replace(']','')
            sentence = sentence.replace("'", "")
            doc = nlp(sentence)
            sen_length = len(sentence)
            matches = matcher(doc)
            start = 0
            s_start = 0
            s_length = 0
            for match_id, start, end in matches:  # iterate over the entities
                label = nlp.vocab.strings[match_id]
                span = doc[start:end]
                label = str(label)
                span = str(span)
                length = len(span)
                start = sentence.index(span)
                end = start + length
                start = str(start)
                end = str(end)
                if s_start != start and s_length != length:
                    s_start = start
                    s_length = length
                    train_it = ('("' + sentence + '", {"entities": [(' + start + ', ' + end 
                           + ', "' + label + '")]}),')
                    output.write(train_it)
                else:
                    continue
    output.close()
    print('Got this far!')

The code block at the end of this routine calculates the starting and ending character offsets for the matched entity, and then constructs up a new string that matches the form expected by spaCy. Note this training data needs to be in JSON format. There are online JSON syntax checkers (here is one) to make sure you are constructing this training example in the correct form. It takes about 8 min to do the conversion above when parsed against our machine learning corpus.

Train the Recognizer

Re-training an existing spaCy NER model means to import the existing model and update the model using the training example snippets. However, if not done properly, one can experience what is called the ‘catastrophic forgetting problem‘, which means that existing trained labels get forgotten as the new ones are learned. Two steps are recommended to limit this problem. First, the training should be limited to about 20 iterations, since repeated iterations risk more forgetting. The second recommended step is to include existing label snippets in the training corpus when training begins. This way the existing label is seen again, and the degree of ‘forgetting’ is lessened.

I looked in vain for finding the existing training examples used by spaCy for its existing entity labels. Not having success in finding such, I decided to create my own snippets with existing labels.

To do so, I repeated similar steps to what was outlined above, only now to use the existing labels rather than new ones. Like before, we also need to construct the training example in the proper JSON form:

import os
import spacy
import random
from spacy.gold import GoldParse

from smart_open import smart_open
import en_core_web_sm

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wp_machine_learning_sentences_48.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\results\standard_training_data_48.json'
documents = smart_open(in_f, 'r', encoding='utf-8')
nlp = en_core_web_sm.load()
revision_data = []
with open(out_f, 'w', encoding='utf-8') as output:
#    for doc in nlp.pipe(revision_texts):
    for line in documents:
        line = str(line)
        sublist = line.split(', ')
        for sentence in sublist:
            sentence = str(sentence)
            sentence = sentence.replace(']','')
            sentence = sentence.replace("'", "")
            sentence = sentence.replace('[','')
            sentence = sentence.replace('\n', '')
            length = len(sentence)
            if length < 40:
                continue
            else:
                doc = nlp(sentence)
#        tags = [w.tag_ for w in doc]
#        heads = [w.head.i for w in doc]
#        deps = [w.dep_ for w in doc]
                entities = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
#        revision_data.append((doc, GoldParse(doc, tags=tags, heads=heads,
#                                            deps=deps, entities=entities)))
                revision_data.append((doc, GoldParse(doc, entities=entities)))
#        print(revision_data)
                doc = str(doc)
                entities = str(entities)
#            revision_str = (doc + ', ' + entities + '\n')
                revision_str = ('("' + doc + '", {"entities": ' + entities + '}),\n')
                output.write(revision_str)
    output.close()
    print('Got this far!')

This second pass using the existing NER labels took about 19 min.

The last step in our prep for updating the existing NER model is to remove short stubs from our training examples, as this code achieves:

from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\results\standard_training_data_48.json'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\results\standard_training_data.json'

documents = smart_open(in_f, 'r', encoding='utf-8')

with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in documents:
        try:
            line = str(line)
            length = len(line)
            if i == length or length < 40:
                continue
            else:
                i = length
                output.write(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
    output.close()
    print('Processing complete!')   

Continue Training the Recognizer

We add about 10% of existing label training examples to those we have already generated for the ‘ML’ training set to overcome the ‘catastrophic forgetting failure’. With this new input deck, we are now ready to run and update the NER recognizer, making sure to keep our iterations below 20 epochs. Here is the code, including some of the 14 K training examples:

# adapted from https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/

# Import and load the spacy model
import spacy
import en_core_web_sm
import random
from spacy.util import minibatch, compounding

output_dir = r'C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml'
 
nlp = en_core_web_sm.load()

# Get the ner component
ner=nlp.get_pipe('ner')

# Add new label
LABEL = 'ML'

# Add the label to ner
ner.add_label(LABEL)

# Load training data 
TRAIN_DATA = [("ökonomik path dependence in spatial networks the", {"entities": [(9, 24, "ML")]}),
("β with respect to the loss function v if the", {"entities": [(22, 35, "ML")]}),
("ε approximate nearest neighbor search include kd", {"entities": [(14, 37, "ML")]}),
("ξ i e φ x is the feature vector produced for a", {"entities": [(17, 24, "ML")]}),
("a a survey on concept drift adaptation acm", {"entities": [(14, 27, "ML")]}),
("a bayesian gaussian mixture model is commonly", {"entities": [(20, 33, "ML")]}),
("a beginner s guide to factor analysis", {"entities": [(22, 37, "ML")]}),
("a boltzmann machine with a few weights labeled", {"entities": [(2, 19, "ML")]}),
# See 'C:\1-PythonProjects\kbpedia\v300\models\inputs\ml_training_data.json' for full listing

# Retrieve labels
for _, annotations in TRAIN_DATA:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])
            
# Resume training (since we are extending an existing model)
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List pipes needed for training
pipe_exceptions = ['ner', 'trf_wordpiecer', 'trf_tok2vec']

# List all other pipes
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]  

# Begin training on restricted pipeline components
with nlp.disable_pipes(*other_pipes):
    random.seed(0)
    sizes = compounding(1.0, 4.0, 1.001)
# Iterate over training set 20 time     
    for itn in range(20):
# Shuffle examples
        random.shuffle(TRAIN_DATA)
# Grab a batch of examples for training
        batches = minibatch(TRAIN_DATA, size=sizes)
# Set empty dictionary
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
# Call update() over epoch; see narrative
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            print('Losses: ', losses)
# Save model to disk
    nlp.meta['name'] = 'ml_sm'  # rename model
    nlp.to_disk(output_dir)
    print('Saved model to: ', output_dir) 
    

When run, a loss calculation is made for each training input example, which gets repeated in full over the number of iterations specified (20 in this case). (Which takes about 4:15 hrs to run on my laptop.) Also note we are saving out the new trained model, en_core_ml, at the completion of the training.

Unfortunately, though my testing showed we were picking up the ‘ML’ tag, it was not doing so for many of the input training examples. To see if I could improve on this performance, I tested the following options, all to no material benefit:

  • Reducing the number of training examples
  • Reducing the number of entity patterns
  • Varying the sentence snippet size, from 48 characters (more typical of online examples) to 180 characters (more typical of actual sentence length)
  • Dialing back the number of iterations (even a single iteration showed the forgetting behavior)
  • Testing the drop setting between 0.5 and 0.0
  • Changing the relative contribution of existing NER labels to the training set
  • Using or not negative (no entity matches) training examples.

In all cases, I continued to see the ‘forgetting’ problem and observed that larger numbers of iterations adversely affected the number of ‘ML’ labels assigned. These results are not in accordance with the documentation I have found. Possible reasons for this continued poor performance might include:

  • Too many diverse patterns in the training set
  • Issues possibly arising from the synthetic sentence snippets I generated
  • An inadquate percentage of existing label training snippets, or
  • A coding error.

(Named) Entity Recognition – Rule-Based

Unlike certain named entities like persons or organizations or locations, the number of ‘machine learning’ instances is more bounded and finite. Since I had already captured the nearly complete entity aspects of the space through the thousand Wikipedia examples, perhaps I did not need a trained model, but one more based on rules and set patterns.

Fortunately, spaCy has such a capability called EntityRuler. It provides rule-based matching for assigning label tags to text. Given that I had been unable to better tune the training model, I decided to test this option as well.

Like the training models, the rule-based approach records a set of known input patterns to provide the matches to text and then labeling. So, aside from the need to assemble the known patterns, which we had already done above, the actual code to invoke this option is rather simple and straightforward:

from spacy.pipeline import EntityRuler
import en_core_web_sm

output_dir = r'C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml'
 
nlp = en_core_web_sm.load()

ruler = EntityRuler(nlp, overwrite_ents=True)

patterns = [{'label': 'ML', 'pattern': 'nearest neighbor search'},
{'label': 'ML', 'pattern': '80 million tiny images'},
{'label': 'ML', 'pattern': 'ablation'},
{'label': 'ML', 'pattern': 'absorbing markov chain'},
{'label': 'ML', 'pattern': 'action model learning'},
{'label': 'ML', 'pattern': 'activation function'},
{'label': 'ML', 'pattern': 'active learning'},
{'label': 'ML', 'pattern': 'activity recognition'},
{'label': 'ML', 'pattern': 'adaboost'},
{'label': 'ML', 'pattern': 'adagrad'},
{'label': 'ML', 'pattern': 'adaline'},
{'label': 'ML', 'pattern': 'adaptive neuro fuzzy inference system'},
{'label': 'ML', 'pattern': 'adaptive resonance theory'},
{'label': 'ML', 'pattern': 'additive smoothing'},
{'label': 'ML', 'pattern': 'adversarial machine learning'},
{'label': 'ML', 'pattern': 'aixi'},
{'label': 'ML', 'pattern': 'alchemyapi'},
{'label': 'ML', 'pattern': 'alexnet'},
{'label': 'ML', 'pattern': 'algorithm selection'},
{'label': 'ML', 'pattern': 'algorithmic bias'},
{'label': 'ML', 'pattern': 'algorithmic composition'},
{'label': 'ML', 'pattern': 'algorithmic inference'},
{'label': 'ML', 'pattern': 'algorithmic learning theory'},
{'label': 'ML', 'pattern': 'algorithms of oppression'},
{'label': 'ML', 'pattern': 'almeida–pineda recurrent backpropagation'},
{'label': 'ML', 'pattern': 'alopex'},
{'label': 'ML', 'pattern': 'alphago'},
{'label': 'ML', 'pattern': 'alphago zero'},
{'label': 'ML', 'pattern': 'alphastar'},
{'label': 'ML', 'pattern': 'alphazero'},
{'label': 'ML', 'pattern': 'alterego'},
{'label': 'ML', 'pattern': 'alternating decision tree'},
{'label': 'ML', 'pattern': 'analogical modeling'},
{'label': 'ML', 'pattern': 'anomaly detection'},
{'label': 'ML', 'pattern': 'anti-unification'},
{'label': 'ML', 'pattern': 'apprenticeship learning'},
{'label': 'ML', 'pattern': 'archetypal analysis'},
{'label': 'ML', 'pattern': 'artificial development'},
{'label': 'ML', 'pattern': 'artificial intelligence system'},
{'label': 'ML', 'pattern': 'artificial neural network'},
{'label': 'ML', 'pattern': 'artificial neuron'},
{'label': 'ML', 'pattern': 'artisto'},
{'label': 'ML', 'pattern': 'associative classifier'},
{'label': 'ML', 'pattern': 'astrostatistics'},
{'label': 'ML', 'pattern': 'augmented analytics'},
{'label': 'ML', 'pattern': 'autoassociative memory'},
{'label': 'ML', 'pattern': 'autoencoder'},
{'label': 'ML', 'pattern': 'automated machine learning'},
{'label': 'ML', 'pattern': 'automated pain recognition'},
{'label': 'ML', 'pattern': 'averaged one-dependence estimators'},
{'label': 'ML', 'pattern': 'backpropagation'},
{'label': 'ML', 'pattern': 'bag-of-words'},
{'label': 'ML', 'pattern': 'ball tree'},
{'label': 'ML', 'pattern': 'base rate'},
{'label': 'ML', 'pattern': 'baum–welch algorithm'},
{'label': 'ML', 'pattern': 'bayesian hierarchical modeling'},
{'label': 'ML', 'pattern': 'bayesian interpretation of kernel regularization'},
{'label': 'ML', 'pattern': 'bayesian network'},
{'label': 'ML', 'pattern': 'bayesian optimization'},
{'label': 'ML', 'pattern': 'bayesian regret'},
{'label': 'ML', 'pattern': 'bayesian structural time series'},
{'label': 'ML', 'pattern': 'bcpnn'},
{'label': 'ML', 'pattern': 'behavioral clustering'},
{'label': 'ML', 'pattern': 'bernoulli scheme'},
{'label': 'ML', 'pattern': 'bias–variance tradeoff'},
{'label': 'ML', 'pattern': 'biclustering'},
{'label': 'ML', 'pattern': 'bidirectional associative memory'},
{'label': 'ML', 'pattern': 'bidirectional recurrent neural networks'},
{'label': 'ML', 'pattern': 'binary classification'},
{'label': 'ML', 'pattern': 'bioz'},
{'label': 'ML', 'pattern': 'boltzmann machine'},
{'label': 'ML', 'pattern': 'bondys theorem'},
{'label': 'ML', 'pattern': 'bongard problem'},
{'label': 'ML', 'pattern': 'boosting'},
{'label': 'ML', 'pattern': 'bootstrap aggregating'},
{'label': 'ML', 'pattern': 'bradley–terry model'},
{'label': 'ML', 'pattern': 'brown clustering'},
{'label': 'ML', 'pattern': 'brownboost'},
{'label': 'ML', 'pattern': 'burst error'},
{'label': 'ML', 'pattern': 'c4.5 algorithm'},
{'label': 'ML', 'pattern': 'calibration'},
{'label': 'ML', 'pattern': 'canonical correspondence analysis'},
{'label': 'ML', 'pattern': 'capsule neural network'},
{'label': 'ML', 'pattern': 'cartesian genetic programming'},
{'label': 'ML', 'pattern': 'cascading classifiers'},
{'label': 'ML', 'pattern': 'case-based reasoning'},
{'label': 'ML', 'pattern': 'catastrophic interference'},
{'label': 'ML', 'pattern': 'category utility'},
{'label': 'ML', 'pattern': 'causal markov condition'},
{'label': 'ML', 'pattern': 'cellular evolutionary algorithm'},
{'label': 'ML', 'pattern': 'cellular neural network'},
{'label': 'ML', 'pattern': 'cerebellar model articulation controller'},
{'label': 'ML', 'pattern': 'chainer'},
{'label': 'ML', 'pattern': 'chi-square automatic interaction detection'},
{'label': 'ML', 'pattern': 'classifier chains'},
{'label': 'ML', 'pattern': 'clever score'},
{'label': 'ML', 'pattern': 'cluster analysis'},
{'label': 'ML', 'pattern': 'clustering high-dimensional data'},
{'label': 'ML', 'pattern': 'clustering illusion'},
{'label': 'ML', 'pattern': 'cma-es'},
{'label': 'ML', 'pattern': 'cn2 algorithm'},
{'label': 'ML', 'pattern': 'co-training'},
{'label': 'ML', 'pattern': 'coboosting'},
{'label': 'ML', 'pattern': 'codi'},
{'label': 'ML', 'pattern': 'cognitive computer'},
{'label': 'ML', 'pattern': 'cognitive robotics'},
{'label': 'ML', 'pattern': 'collostructional analysis'},
{'label': 'ML', 'pattern': 'committee machine'},
{'label': 'ML', 'pattern': 'common-method variance'},
{'label': 'ML', 'pattern': 'competitive learning'},
{'label': 'ML', 'pattern': 'compositional pattern-producing network'},
{'label': 'ML', 'pattern': 'computational cybernetics'},
{'label': 'ML', 'pattern': 'computational learning theory'},
{'label': 'ML', 'pattern': 'computational neurogenetic modeling'},
{'label': 'ML', 'pattern': 'computer-automated design'},
{'label': 'ML', 'pattern': 'concept class'},
{'label': 'ML', 'pattern': 'concept drift'},
{'label': 'ML', 'pattern': 'concept learning'},
{'label': 'ML', 'pattern': 'conceptual clustering'},
{'label': 'ML', 'pattern': 'conditional random field'},
{'label': 'ML', 'pattern': 'confabulation'},
{'label': 'ML', 'pattern': 'confusion matrix'},
{'label': 'ML', 'pattern': 'connectionist temporal classification'},
{'label': 'ML', 'pattern': 'consensus clustering'},
{'label': 'ML', 'pattern': 'constellation model'},
{'label': 'ML', 'pattern': 'constrained clustering'},
{'label': 'ML', 'pattern': 'constrained conditional model'},
{'label': 'ML', 'pattern': 'constructing skill trees'},
{'label': 'ML', 'pattern': 'constructive cooperative coevolution'},
{'label': 'ML', 'pattern': 'conversica'},
{'label': 'ML', 'pattern': 'convolutional deep belief network'},
{'label': 'ML', 'pattern': 'convolutional neural network', 'id': 'cnn'},
{'label': 'ML', 'pattern': 'CNN', 'id': 'cnn'},            
{'label': 'ML', 'pattern': 'correlation clustering'},
{'label': 'ML', 'pattern': 'correspondence analysis'},
{'label': 'ML', 'pattern': 'count sketch'},
{'label': 'ML', 'pattern': 'coupled pattern learner'},
{'label': 'ML', 'pattern': 'covers theorem'},
{'label': 'ML', 'pattern': 'cross entropy'},
{'label': 'ML', 'pattern': 'cross-validation'},
{'label': 'ML', 'pattern': 'cultural algorihm'},
{'label': 'ML', 'pattern': 'curse of dimensionality'},
{'label': 'ML', 'pattern': 'darkforest'},
{'label': 'ML', 'pattern': 'darpa lagr program'},
{'label': 'ML', 'pattern': 'darwintunes'},
{'label': 'ML', 'pattern': 'data augmentation'},
{'label': 'ML', 'pattern': 'data exploration'},
{'label': 'ML', 'pattern': 'data pre-processing'},
{'label': 'ML', 'pattern': 'datasets.load'},
{'label': 'ML', 'pattern': 'decision boundary'},
{'label': 'ML', 'pattern': 'decision list'},
{'label': 'ML', 'pattern': 'decision tree learning'},
{'label': 'ML', 'pattern': 'decision tree pruning'},
{'label': 'ML', 'pattern': 'deductive classifier'},
{'label': 'ML', 'pattern': 'deep belief network'},
{'label': 'ML', 'pattern': 'deep image prior'},
{'label': 'ML', 'pattern': 'deep instinct'},
{'label': 'ML', 'pattern': 'deep lambertian networks'},
{'label': 'ML', 'pattern': 'deep learning'},
{'label': 'ML', 'pattern': 'deep learning processor'},
{'label': 'ML', 'pattern': 'deep learning studio'},
{'label': 'ML', 'pattern': 'deep reinforcement learning'},
{'label': 'ML', 'pattern': 'deepfake'},
{'label': 'ML', 'pattern': 'deepfake pornography'},
{'label': 'ML', 'pattern': 'deeplearning4j'},
{'label': 'ML', 'pattern': 'deepmind'},
{'label': 'ML', 'pattern': 'deepnude'},
{'label': 'ML', 'pattern': 'deepspeed'},
{'label': 'ML', 'pattern': 'dehaene–changeux model'},
{'label': 'ML', 'pattern': 'delta rule'},
{'label': 'ML', 'pattern': 'dendrogram'},
{'label': 'ML', 'pattern': 'dependability state model'},
{'label': 'ML', 'pattern': 'detailed balance'},
{'label': 'ML', 'pattern': 'detrended correspondence analysis'},
{'label': 'ML', 'pattern': 'developmental robotics'},
{'label': 'ML', 'pattern': 'dexnet'},
{'label': 'ML', 'pattern': 'diffbot'},
{'label': 'ML', 'pattern': 'differentiable neural computer'},
{'label': 'ML', 'pattern': 'differential evolution'},
{'label': 'ML', 'pattern': 'diffusion map'},
{'label': 'ML', 'pattern': 'dimensionality reduction'},
{'label': 'ML', 'pattern': 'discrete phase-type distribution'},
{'label': 'ML', 'pattern': 'discriminative model'},
{'label': 'ML', 'pattern': 'dispersive flies optimisation'},
{'label': 'ML', 'pattern': 'dissociated press'},
{'label': 'ML', 'pattern': 'distribution learning theory'},
{'label': 'ML', 'pattern': 'document classification'},
{'label': 'ML', 'pattern': 'domain adaptation'},
{'label': 'ML', 'pattern': 'dominance-based rough set approach'},
{'label': 'ML', 'pattern': 'doubly stochastic model'},
{'label': 'ML', 'pattern': 'dynamic bayesian network'},
{'label': 'ML', 'pattern': 'dynamic markov compression'},
{'label': 'ML', 'pattern': 'dynamic time warping'},
{'label': 'ML', 'pattern': 'dynamic topic model'},
{'label': 'ML', 'pattern': 'dynamic unobserved effects model'},
{'label': 'ML', 'pattern': 'eager learning'},
{'label': 'ML', 'pattern': 'early stopping'},
{'label': 'ML', 'pattern': 'echo state network'},
{'label': 'ML', 'pattern': 'effective fitness'},
{'label': 'ML', 'pattern': 'elastic map'},
{'label': 'ML', 'pattern': 'elastic matching'},
{'label': 'ML', 'pattern': 'elastic net regularization'},
{'label': 'ML', 'pattern': 'electricity price forecasting'},
{'label': 'ML', 'pattern': 'elmo'},
{'label': 'ML', 'pattern': 'em algorithm and gmm model'},
{'label': 'ML', 'pattern': 'empirical risk minimization'},
{'label': 'ML', 'pattern': 'end-to-end reinforcement learning'},
{'label': 'ML', 'pattern': 'ensemble learning'},
{'label': 'ML', 'pattern': 'entropy rate'},
{'label': 'ML', 'pattern': 'error tolerance'},
{'label': 'ML', 'pattern': 'error-driven learning'},
{'label': 'ML', 'pattern': 'eurisko'},
{'label': 'ML', 'pattern': 'european neural network society'},
{'label': 'ML', 'pattern': 'evaluation of binary classifiers'},
{'label': 'ML', 'pattern': 'evolution strategy'},
{'label': 'ML', 'pattern': 'evolution window'},
{'label': 'ML', 'pattern': 'evolutionary algorithm'},
{'label': 'ML', 'pattern': 'evolutionary art'},
{'label': 'ML', 'pattern': 'evolutionary multimodal optimization'},
{'label': 'ML', 'pattern': 'evolutionary music'},
{'label': 'ML', 'pattern': 'evolutionary programming'},
{'label': 'ML', 'pattern': 'evolvability'},
{'label': 'ML', 'pattern': 'evolved antenna'},
{'label': 'ML', 'pattern': 'evolving classification function'},
{'label': 'ML', 'pattern': 'examples of markov chains'},
{'label': 'ML', 'pattern': 'expectation propagation'},
{'label': 'ML', 'pattern': 'expectation–maximization algorithm'},
{'label': 'ML', 'pattern': 'explanation-based learning'},
{'label': 'ML', 'pattern': 'extension neural network'},
{'label': 'ML', 'pattern': 'extremal ensemble learning'},
{'label': 'ML', 'pattern': 'extreme learning machine'},
{'label': 'ML', 'pattern': 'f-score'},
{'label': 'ML', 'pattern': 'faceapp'},
{'label': 'ML', 'pattern': 'facial recognition system'},
{'label': 'ML', 'pattern': 'factor analysis'},
{'label': 'ML', 'pattern': 'factor regression model'},
{'label': 'ML', 'pattern': 'factored language model'},
{'label': 'ML', 'pattern': 'farthest-first traversal'},
{'label': 'ML', 'pattern': 'feature engineering'},
{'label': 'ML', 'pattern': 'feature extraction'},
{'label': 'ML', 'pattern': 'feature hashing'},
{'label': 'ML', 'pattern': 'feature learning'},
{'label': 'ML', 'pattern': 'feature scaling'},
{'label': 'ML', 'pattern': 'feature selection'},
{'label': 'ML', 'pattern': 'feature selection toolbox'},
{'label': 'ML', 'pattern': 'federated learning'},
{'label': 'ML', 'pattern': 'feed forward'},
{'label': 'ML', 'pattern': 'feedforward neural network'},
{'label': 'ML', 'pattern': 'feret database'},
{'label': 'ML', 'pattern': 'findface'},
{'label': 'ML', 'pattern': 'first-difference estimator'},
{'label': 'ML', 'pattern': 'first-order inductive learner'},
{'label': 'ML', 'pattern': 'fisher kernel'},
{'label': 'ML', 'pattern': 'fitness approximation'},
{'label': 'ML', 'pattern': 'fly algorithm'},
{'label': 'ML', 'pattern': 'formal concept analysis'},
{'label': 'ML', 'pattern': 'forward algorithm'},
{'label': 'ML', 'pattern': 'forward–backward algorithm'},
{'label': 'ML', 'pattern': 'frequent pattern discovery'},
{'label': 'ML', 'pattern': 'gated recurrent unit'},
{'label': 'ML', 'pattern': 'gaussian adaptation'},
{'label': 'ML', 'pattern': 'gaussian process'},
{'label': 'ML', 'pattern': 'gaussian process emulator'},
{'label': 'ML', 'pattern': 'gene expression programming'},
{'label': 'ML', 'pattern': 'gene prediction'},
{'label': 'ML', 'pattern': 'general regression neural network'},
{'label': 'ML', 'pattern': 'generalization error'},
{'label': 'ML', 'pattern': 'generalized canonical correlation'},
{'label': 'ML', 'pattern': 'generalized filtering'},
{'label': 'ML', 'pattern': 'generalized hebbian algorithm'},
{'label': 'ML', 'pattern': 'generalized iterative scaling'},
{'label': 'ML', 'pattern': 'generalized multidimensional scaling'},
{'label': 'ML', 'pattern': 'generative adversarial network'},
{'label': 'ML', 'pattern': 'generative model'},
{'label': 'ML', 'pattern': 'generative topographic map'},
{'label': 'ML', 'pattern': 'generec'},
{'label': 'ML', 'pattern': 'genetic algorithm'},
{'label': 'ML', 'pattern': 'genetic programming'},
{'label': 'ML', 'pattern': 'genetic representation'},
{'label': 'ML', 'pattern': 'geographical cluster'},
{'label': 'ML', 'pattern': 'gesture description language'},
{'label': 'ML', 'pattern': 'geworkbench'},
{'label': 'ML', 'pattern': 'glimmer'},
{'label': 'ML', 'pattern': 'glottochronology'},
{'label': 'ML', 'pattern': 'google brain'},
{'label': 'ML', 'pattern': 'google matrix'},
{'label': 'ML', 'pattern': 'google nest'},
{'label': 'ML', 'pattern': 'google neural machine translation'},
{'label': 'ML', 'pattern': 'gpt'},
{'label': 'ML', 'pattern': 'gpt-2'},
{'label': 'ML', 'pattern': 'gpt-3'},
{'label': 'ML', 'pattern': 'gradient boosting'},
{'label': 'ML', 'pattern': 'gramian matrix'},
{'label': 'ML', 'pattern': 'grammar induction'},
{'label': 'ML', 'pattern': 'grammatical evolution'},
{'label': 'ML', 'pattern': 'granular computing'},
{'label': 'ML', 'pattern': 'graph kernel'},
{'label': 'ML', 'pattern': 'grossberg network'},
{'label': 'ML', 'pattern': 'group method of data handling'},
{'label': 'ML', 'pattern': 'growing self-organizing map'},
{'label': 'ML', 'pattern': 'growth function'},
{'label': 'ML', 'pattern': 'handwriting recognition'},
{'label': 'ML', 'pattern': 'hard sigmoid'},
{'label': 'ML', 'pattern': 'hebbian theory'},
{'label': 'ML', 'pattern': 'helmholtz machine'},
{'label': 'ML', 'pattern': 'hidden markov model'},
{'label': 'ML', 'pattern': 'hierarchical classification'},
{'label': 'ML', 'pattern': 'hierarchical temporal memory'},
{'label': 'ML', 'pattern': 'hinge loss'},
{'label': 'ML', 'pattern': 'hopfield network'},
{'label': 'ML', 'pattern': 'horovod'},
{'label': 'ML', 'pattern': 'huber loss'},
{'label': 'ML', 'pattern': 'hybrid kohonen self-organizing map'},
{'label': 'ML', 'pattern': 'hybrid neural network'},
{'label': 'ML', 'pattern': 'hyper basis function network'},
{'label': 'ML', 'pattern': 'hyperneat'},
{'label': 'ML', 'pattern': 'hyperparameter'},
{'label': 'ML', 'pattern': 'hyperparameter optimization'},
{'label': 'ML', 'pattern': 'id3 algorithm'},
{'label': 'ML', 'pattern': 'idistance'},
{'label': 'ML', 'pattern': 'imagenets'},
{'label': 'ML', 'pattern': 'inauthentic text'},
{'label': 'ML', 'pattern': 'incremental learning'},
{'label': 'ML', 'pattern': 'independent component analysis'},
{'label': 'ML', 'pattern': 'induction of regular languages'},
{'label': 'ML', 'pattern': 'inductive bias'},
{'label': 'ML', 'pattern': 'inductive logic programming'},
{'label': 'ML', 'pattern': 'inductive probability'},
{'label': 'ML', 'pattern': 'inductive programming'},
{'label': 'ML', 'pattern': 'infer.net'},
{'label': 'ML', 'pattern': 'inferential theory of learning'},
{'label': 'ML', 'pattern': 'influence diagram'},
{'label': 'ML', 'pattern': 'infomax'},
{'label': 'ML', 'pattern': 'information fuzzy networks'},
{'label': 'ML', 'pattern': 'information gain in decision trees'},
{'label': 'ML', 'pattern': 'information gain ratio'},
{'label': 'ML', 'pattern': 'instance selection'},
{'label': 'ML', 'pattern': 'instance-based learning'},
{'label': 'ML', 'pattern': 'instantaneously trained neural networks'},
{'label': 'ML', 'pattern': 'intel realsense'},
{'label': 'ML', 'pattern': 'interacting particle system'},
{'label': 'ML', 'pattern': 'interactive activation and competition networks'},
{'label': 'ML', 'pattern': 'interactive machine translation'},
{'label': 'ML', 'pattern': 'inverted pendulum'},
{'label': 'ML', 'pattern': 'ipo underpricing algorithm'},
{'label': 'ML', 'pattern': 'ircf360'},
{'label': 'ML', 'pattern': 'isolation forest'},
{'label': 'ML', 'pattern': 'isotropic position'},
{'label': 'ML', 'pattern': 'item response theory'},
{'label': 'ML', 'pattern': 'iterative viterbi decoding'},
{'label': 'ML', 'pattern': 'java grammatical evolution'},
{'label': 'ML', 'pattern': 'jpred'},
{'label': 'ML', 'pattern': 'junction tree algorithm'},
{'label': 'ML', 'pattern': 'k-nearest neighbors'},
{'label': 'ML', 'pattern': 'kalman filter'},
{'label': 'ML', 'pattern': 'katzs back-off model'},
{'label': 'ML', 'pattern': 'KBpedia series'},
{'label': 'ML', 'pattern': 'keras'},
{'label': 'ML', 'pattern': 'kernel adaptive filter'},
{'label': 'ML', 'pattern': 'kernel density estimation'},
{'label': 'ML', 'pattern': 'kernel eigenvoice'},
{'label': 'ML', 'pattern': 'kernel embedding of distributions'},
{'label': 'ML', 'pattern': 'kernel method'},
{'label': 'ML', 'pattern': 'kernel perceptron'},
{'label': 'ML', 'pattern': 'kernel principal component analysis'},
{'label': 'ML', 'pattern': 'kinect'},
{'label': 'ML', 'pattern': 'knowledge distillation'},
{'label': 'ML', 'pattern': 'knowledge integration'},
{'label': 'ML', 'pattern': 'label propagation algorithm'},
{'label': 'ML', 'pattern': 'labeled data'},
{'label': 'ML', 'pattern': 'language acquisition device'},
{'label': 'ML', 'pattern': 'language model'},
{'label': 'ML', 'pattern': 'large margin nearest neighbor'},
{'label': 'ML', 'pattern': 'large memory storage and retrieval neural network'},
{'label': 'ML', 'pattern': 'latent class model'},
{'label': 'ML', 'pattern': 'latent dirichlet allocation'},
{'label': 'ML', 'pattern': 'latent semantic analysis'},
{'label': 'ML', 'pattern': 'latent variable'},
{'label': 'ML', 'pattern': 'latent variable model'},
{'label': 'ML', 'pattern': 'lazy learning'},
{'label': 'ML', 'pattern': 'leabra'},
{'label': 'ML', 'pattern': 'leakage'},
{'label': 'ML', 'pattern': 'learnable function class'},
{'label': 'ML', 'pattern': 'learning automaton'},
{'label': 'ML', 'pattern': 'learning classifier system'},
{'label': 'ML', 'pattern': 'learning curve'},
{'label': 'ML', 'pattern': 'learning rate'},
{'label': 'ML', 'pattern': 'learning rule'},
{'label': 'ML', 'pattern': 'learning to rank'},
{'label': 'ML', 'pattern': 'learning vector quantization'},
{'label': 'ML', 'pattern': 'learning with errors'},
{'label': 'ML', 'pattern': 'least-squares support-vector machine'},
{'label': 'ML', 'pattern': 'leave-one-out error'},
{'label': 'ML', 'pattern': 'leela chess zero'},
{'label': 'ML', 'pattern': 'leela zero'},
{'label': 'ML', 'pattern': 'lenet'},
{'label': 'ML', 'pattern': 'lernmatrix'},
{'label': 'ML', 'pattern': 'life-time of correlation'},
{'label': 'ML', 'pattern': 'lightgbm'},
{'label': 'ML', 'pattern': 'linde–buzo–gray algorithm'},
{'label': 'ML', 'pattern': 'linear classifier'},
{'label': 'ML', 'pattern': 'linear discriminant analysis'},
{'label': 'ML', 'pattern': 'linear genetic programming'},
{'label': 'ML', 'pattern': 'linear predictor function'},
{'label': 'ML', 'pattern': 'linear separability'},
{'label': 'ML', 'pattern': 'liquid state machine'},
{'label': 'ML', 'pattern': 'list of datasets for machine-learning research'},
{'label': 'ML', 'pattern': 'local case-control sampling'},
{'label': 'ML', 'pattern': 'local independence'},
{'label': 'ML', 'pattern': 'local outlier factor'},
{'label': 'ML', 'pattern': 'local tangent space alignment'},
{'label': 'ML', 'pattern': 'locality-sensitive hashing'},
{'label': 'ML', 'pattern': 'log-linear model'},
{'label': 'ML', 'pattern': 'logic learning machine'},
{'label': 'ML', 'pattern': 'logitboost'},
{'label': 'ML', 'pattern': 'long short-term memory'},
{'label': 'ML', 'pattern': 'loss function'},
{'label': 'ML', 'pattern': 'loss functions for classification'},
{'label': 'ML', 'pattern': 'low-rank approximation'},
{'label': 'ML', 'pattern': 'low-rank matrix approximations'},
{'label': 'ML', 'pattern': 'lpboost'},
{'label': 'ML', 'pattern': 'm-theory'},
{'label': 'ML', 'pattern': 'machine learning'},
{'label': 'ML', 'pattern': 'machine_learning'},            
{'label': 'ML', 'pattern': 'manifold alignment'},
{'label': 'ML', 'pattern': 'manifold regularization'},
{'label': 'ML', 'pattern': 'margin classifier'},
{'label': 'ML', 'pattern': 'margin-infused relaxed algorithm'},
{'label': 'ML', 'pattern': 'markov blanket'},
{'label': 'ML', 'pattern': 'markov chain'},
{'label': 'ML', 'pattern': 'markov chain central limit theorem'},
{'label': 'ML', 'pattern': 'markov chain geostatistics'},
{'label': 'ML', 'pattern': 'markov chain monte carlo'},
{'label': 'ML', 'pattern': 'markov information source'},
{'label': 'ML', 'pattern': 'markov model'},
{'label': 'ML', 'pattern': 'markov partition'},
{'label': 'ML', 'pattern': 'markov property'},
{'label': 'ML', 'pattern': 'markov switching multifractal'},
{'label': 'ML', 'pattern': 'markovian discrimination'},
{'label': 'ML', 'pattern': 'matchbox educable noughts and crosses engine'},
{'label': 'ML', 'pattern': 'matrix regularization'},
{'label': 'ML', 'pattern': 'matthews correlation coefficient'},
{'label': 'ML', 'pattern': 'maximum-entropy markov model'},
{'label': 'ML', 'pattern': 'mean squared error'},
{'label': 'ML', 'pattern': 'mean squared prediction error'},
{'label': 'ML', 'pattern': 'measurement invariance'},
{'label': 'ML', 'pattern': 'medoid'},
{'label': 'ML', 'pattern': 'megahal'},
{'label': 'ML', 'pattern': 'melomics'},
{'label': 'ML', 'pattern': 'memetic algorithm'},
{'label': 'ML', 'pattern': 'memtransistor'},
{'label': 'ML', 'pattern': 'meta learning'},
{'label': 'ML', 'pattern': 'meta-optimization'},
{'label': 'ML', 'pattern': 'microsoft cognitive toolkit'},
{'label': 'ML', 'pattern': 'minimum population search'},
{'label': 'ML', 'pattern': 'minimum redundancy feature selection'},
{'label': 'ML', 'pattern': 'mixture model'},
{'label': 'ML', 'pattern': 'mixture of experts'},
{'label': 'ML', 'pattern': 'ml.net'},
{'label': 'ML', 'pattern': 'mlops'},
{'label': 'ML', 'pattern': 'model-free'},
{'label': 'ML', 'pattern': 'models of dna evolution'},
{'label': 'ML', 'pattern': 'modes of variation'},
{'label': 'ML', 'pattern': 'modular neural network'},
{'label': 'ML', 'pattern': 'moea framework'},
{'label': 'ML', 'pattern': 'mokken scale'},
{'label': 'ML', 'pattern': 'moneybee'},
{'label': 'ML', 'pattern': 'moral graph'},
{'label': 'ML', 'pattern': 'mountain car problem'},
{'label': 'ML', 'pattern': 'multi expression programming'},
{'label': 'ML', 'pattern': 'multi-agent learning'},
{'label': 'ML', 'pattern': 'multi-armed bandit'},
{'label': 'ML', 'pattern': 'multi-label classification'},
{'label': 'ML', 'pattern': 'multi-objective reinforcement learning'},
{'label': 'ML', 'pattern': 'multi-surface method'},
{'label': 'ML', 'pattern': 'multi-task learning'},
{'label': 'ML', 'pattern': 'multiclass classification'},
{'label': 'ML', 'pattern': 'multidimensional analysis'},
{'label': 'ML', 'pattern': 'multidimensional scaling'},
{'label': 'ML', 'pattern': 'multifactor dimensionality reduction'},
{'label': 'ML', 'pattern': 'multilayer perceptron'},
{'label': 'ML', 'pattern': 'multilinear principal component analysis'},
{'label': 'ML', 'pattern': 'multilinear subspace learning'},
{'label': 'ML', 'pattern': 'multimodal learning'},
{'label': 'ML', 'pattern': 'multimodal sentiment analysis'},
{'label': 'ML', 'pattern': 'multinomial logistic regression'},
{'label': 'ML', 'pattern': 'multiple correspondence analysis'},
{'label': 'ML', 'pattern': 'multiple discriminant analysis'},
{'label': 'ML', 'pattern': 'multiple discriminant analysis'},
{'label': 'ML', 'pattern': 'multiple instance learning'},
{'label': 'ML', 'pattern': 'multiple kernel learning'},
{'label': 'ML', 'pattern': 'multiple sequence alignment'},
{'label': 'ML', 'pattern': 'multiple-instance learning'},
{'label': 'ML', 'pattern': 'multiplicative weight update method'},
{'label': 'ML', 'pattern': 'multispectral pattern recognition'},
{'label': 'ML', 'pattern': 'multitask optimization'},
{'label': 'ML', 'pattern': 'multivariate adaptive regression spline'},
{'label': 'ML', 'pattern': 'naive bayes classifier'},
{'label': 'ML', 'pattern': 'native-language identification'},
{'label': 'ML', 'pattern': 'natural evolution strategy'},
{'label': 'ML', 'pattern': 'natural language toolkit'},
{'label': 'ML', 'pattern': 'nature machine intelligence'},
{'label': 'ML', 'pattern': 'nearest centroid classifier'},
{'label': 'ML', 'pattern': 'nearest neighbor search'},
{'label': 'ML', 'pattern': 'neocognitron'},
{'label': 'ML', 'pattern': 'netomi'},
{'label': 'ML', 'pattern': 'nettalk'},
{'label': 'ML', 'pattern': 'neural cryptography'},
{'label': 'ML', 'pattern': 'neural designer'},
{'label': 'ML', 'pattern': 'neural gas'},
{'label': 'ML', 'pattern': 'neural modeling fields'},
{'label': 'ML', 'pattern': 'neural network gaussian process'},
{'label': 'ML', 'pattern': 'neural network intelligence'},
{'label': 'ML', 'pattern': 'neural network software'},
{'label': 'ML', 'pattern': 'neural network synchronization protocol'},
{'label': 'ML', 'pattern': 'neural networks'},
{'label': 'ML', 'pattern': 'neural style transfer'},
{'label': 'ML', 'pattern': 'neural tangent kernel'},
{'label': 'ML', 'pattern': 'neural turing machine'},
{'label': 'ML', 'pattern': 'neuroevolution'},
{'label': 'ML', 'pattern': 'neuroevolution of augmenting topologies'},
{'label': 'ML', 'pattern': 'ni1000'},
{'label': 'ML', 'pattern': 'niki.ai'},
{'label': 'ML', 'pattern': 'node2vec'},
{'label': 'ML', 'pattern': 'noisy channel model'},
{'label': 'ML', 'pattern': 'noisy text analytics'},
{'label': 'ML', 'pattern': 'non-negative matrix factorization'},
{'label': 'ML', 'pattern': 'nonlinear dimensionality reduction'},
{'label': 'ML', 'pattern': 'normal discriminant analysis'},
{'label': 'ML', 'pattern': 'novelty detection'},
{'label': 'ML', 'pattern': 'nuisance variable'},
{'label': 'ML', 'pattern': 'nvdla'},
{'label': 'ML', 'pattern': 'object co-segmentation'},
{'label': 'ML', 'pattern': 'occam learning'},
{'label': 'ML', 'pattern': 'offline learning'},
{'label': 'ML', 'pattern': 'ojas rule'},
{'label': 'ML', 'pattern': 'one-class classification'},
{'label': 'ML', 'pattern': 'one-shot learning'},
{'label': 'ML', 'pattern': 'online machine learning'},
{'label': 'ML', 'pattern': 'onnx'},
{'label': 'ML', 'pattern': 'ontology learning'},
{'label': 'ML', 'pattern': 'openai api'},
{'label': 'ML', 'pattern': 'openai five'},
{'label': 'ML', 'pattern': 'opennn'},
{'label': 'ML', 'pattern': 'openvino'},
{'label': 'ML', 'pattern': 'operational taxonomic unit'},
{'label': 'ML', 'pattern': 'optical character recognition'},
{'label': 'ML', 'pattern': 'optical neural network'},
{'label': 'ML', 'pattern': 'optimal discriminant analysis and classification tree analysis'},
{'label': 'ML', 'pattern': 'oscillatory neural network'},
{'label': 'ML', 'pattern': 'out-of-bag error'},
{'label': 'ML', 'pattern': 'outline of machine learning'},
{'label': 'ML', 'pattern': 'overfitting'},
{'label': 'ML', 'pattern': 'pachinko allocation'},
{'label': 'ML', 'pattern': 'pagerank'},
{'label': 'ML', 'pattern': 'paraphrasing'},
{'label': 'ML', 'pattern': 'parity benchmark'},
{'label': 'ML', 'pattern': 'parity learning'},
{'label': 'ML', 'pattern': 'part-of-speech tagging'},
{'label': 'ML', 'pattern': 'partial least squares regression'},
{'label': 'ML', 'pattern': 'particle swarm optimization'},
{'label': 'ML', 'pattern': 'path dependence'},
{'label': 'ML', 'pattern': 'pattern language'},
{'label': 'ML', 'pattern': 'pattern recognition'},
{'label': 'ML', 'pattern': 'perceptron'},
{'label': 'ML', 'pattern': 'physical neural network'},
{'label': 'ML', 'pattern': 'plate notation'},
{'label': 'ML', 'pattern': 'polynomial kernel'},
{'label': 'ML', 'pattern': 'pop music automation'},
{'label': 'ML', 'pattern': 'population process'},
{'label': 'ML', 'pattern': 'portable format for analytics'},
{'label': 'ML', 'pattern': 'predictive learning'},
{'label': 'ML', 'pattern': 'predictive model markup language'},
{'label': 'ML', 'pattern': 'predictive state representation'},
{'label': 'ML', 'pattern': 'preference learning'},
{'label': 'ML', 'pattern': 'preference regression'},
{'label': 'ML', 'pattern': 'prefrontal cortex basal ganglia working memory'},
{'label': 'ML', 'pattern': 'principal component analysis'},
{'label': 'ML', 'pattern': 'prior knowledge for pattern recognition'},
{'label': 'ML', 'pattern': 'proactive learning'},
{'label': 'ML', 'pattern': 'proaftn'},
{'label': 'ML', 'pattern': 'probabilistic context-free grammar'},
{'label': 'ML', 'pattern': 'probabilistic latent semantic analysis'},
{'label': 'ML', 'pattern': 'probabilistic neural network'},
{'label': 'ML', 'pattern': 'probability matching'},
{'label': 'ML', 'pattern': 'probably approximately correct learning'},
{'label': 'ML', 'pattern': 'probit model'},
{'label': 'ML', 'pattern': 'product of experts'},
{'label': 'ML', 'pattern': 'progol'},
{'label': 'ML', 'pattern': 'programming by example'},
{'label': 'ML', 'pattern': 'promoter based genetic algorithm'},
{'label': 'ML', 'pattern': 'proper generalized decomposition'},
{'label': 'ML', 'pattern': 'prototype methods'},
{'label': 'ML', 'pattern': 'proximal gradient method'},
{'label': 'ML', 'pattern': 'pulse-coupled networks'},
{'label': 'ML', 'pattern': 'pvlv'},
{'label': 'ML', 'pattern': 'q-learning'},
{'label': 'ML', 'pattern': 'quadratic classifier'},
{'label': 'ML', 'pattern': 'quadratic unconstrained binary optimization'},
{'label': 'ML', 'pattern': 'quantum machine learning'},
{'label': 'ML', 'pattern': 'quantum markov chain'},
{'label': 'ML', 'pattern': 'quantum neural network'},
{'label': 'ML', 'pattern': 'query-level feature'},
{'label': 'ML', 'pattern': 'question answering'},
{'label': 'ML', 'pattern': 'queueing theory'},
{'label': 'ML', 'pattern': 'quickprop'},
{'label': 'ML', 'pattern': 'rademacher complexity'},
{'label': 'ML', 'pattern': 'radial basis function'},
{'label': 'ML', 'pattern': 'radial basis function kernel'},
{'label': 'ML', 'pattern': 'radial basis function network'},
{'label': 'ML', 'pattern': 'ramnets'},
{'label': 'ML', 'pattern': 'random forest'},
{'label': 'ML', 'pattern': 'random indexing'},
{'label': 'ML', 'pattern': 'random neural network'},
{'label': 'ML', 'pattern': 'random projection'},
{'label': 'ML', 'pattern': 'random subspace method'},
{'label': 'ML', 'pattern': 'randomized weighted majority algorithm'},
{'label': 'ML', 'pattern': 'ranking svm'},
{'label': 'ML', 'pattern': 'reasoning system'},
{'label': 'ML', 'pattern': 'rectifier'},
{'label': 'ML', 'pattern': 'recurrent neural network'},
{'label': 'ML', 'pattern': 'recursive neural network'},
{'label': 'ML', 'pattern': 'region based convolutional neural networks'},
{'label': 'ML', 'pattern': 'reinforcement learning'},
{'label': 'ML', 'pattern': 'relation network'},
{'label': 'ML', 'pattern': 'relational data mining'},
{'label': 'ML', 'pattern': 'relationship square'},
{'label': 'ML', 'pattern': 'relevance vector machine'},
{'label': 'ML', 'pattern': 'representer theorem'},
{'label': 'ML', 'pattern': 'reservoir computing'},
{'label': 'ML', 'pattern': 'residual neural network'},
{'label': 'ML', 'pattern': 'restricted boltzmann machine'},
{'label': 'ML', 'pattern': 'revoscalepy'},
{'label': 'ML', 'pattern': 'revoscaler'},
{'label': 'ML', 'pattern': 'reward-based selection'},
{'label': 'ML', 'pattern': 'right to explanation'},
{'label': 'ML', 'pattern': 'rnn'},
{'label': 'ML', 'pattern': 'robot learning'},
{'label': 'ML', 'pattern': 'robotic process automation'},
{'label': 'ML', 'pattern': 'robust principal component analysis'},
{'label': 'ML', 'pattern': 'rprop'},
{'label': 'ML', 'pattern': 'rule induction'},
{'label': 'ML', 'pattern': 'rule-based machine learning'},
{'label': 'ML', 'pattern': 'rules extraction system family'},
{'label': 'ML', 'pattern': 'sammon mapping'},
{'label': 'ML', 'pattern': 'sample complexity'},
{'label': 'ML', 'pattern': 'sample exclusion dimension'},
{'label': 'ML', 'pattern': 'santa fe trail problem'},
{'label': 'ML', 'pattern': 'scale-invariant feature operator'},
{'label': 'ML', 'pattern': 'scikit-multiflow'},
{'label': 'ML', 'pattern': 'self-organizing map'},
{'label': 'ML', 'pattern': 'semantic analysis'},
{'label': 'ML', 'pattern': 'semantic folding'},
{'label': 'ML', 'pattern': 'semantic mapping'},
{'label': 'ML', 'pattern': 'semantic neural network'},
{'label': 'ML', 'pattern': 'semi-supervised learning'},
{'label': 'ML', 'pattern': 'semidefinite embedding'},
{'label': 'ML', 'pattern': 'sense networks'},
{'label': 'ML', 'pattern': 'sentence embedding'},
{'label': 'ML', 'pattern': 'seq2seq'},
{'label': 'ML', 'pattern': 'sequence labeling'},
{'label': 'ML', 'pattern': 'sequential minimal optimization'},
{'label': 'ML', 'pattern': 'shattered set'},
{'label': 'ML', 'pattern': 'siamese neural network'},
{'label': 'ML', 'pattern': 'sigmoid function'},
{'label': 'ML', 'pattern': 'similarity learning'},
{'label': 'ML', 'pattern': 'simultaneous localization and mapping'},
{'label': 'ML', 'pattern': 'sinkov statistic'},
{'label': 'ML', 'pattern': 'skill chaining'},
{'label': 'ML', 'pattern': 'sliced inverse regression'},
{'label': 'ML', 'pattern': 'soboleva modified hyperbolic tangent'},
{'label': 'ML', 'pattern': 'soft output viterbi algorithm'},
{'label': 'ML', 'pattern': 'softmax function'},
{'label': 'ML', 'pattern': 'solomonoffs theory of inductive inference'},
{'label': 'ML', 'pattern': 'sparse dictionary learning'},
{'label': 'ML', 'pattern': 'sparse pca'},
{'label': 'ML', 'pattern': 'speech recognition'},
{'label': 'ML', 'pattern': 'spike-and-slab regression'},
{'label': 'ML', 'pattern': 'spiking neural network'},
{'label': 'ML', 'pattern': 'spiral optimization algorithm'},
{'label': 'ML', 'pattern': 'squeezenet'},
{'label': 'ML', 'pattern': 'state–action–reward–state–action'},
{'label': 'ML', 'pattern': 'statistical classification'},
{'label': 'ML', 'pattern': 'statistical learning theory'},
{'label': 'ML', 'pattern': 'statistical machine translation'},
{'label': 'ML', 'pattern': 'statistical parsing'},
{'label': 'ML', 'pattern': 'statistical relational learning'},
{'label': 'ML', 'pattern': 'statistical semantics'},
{'label': 'ML', 'pattern': 'stochastic block model'},
{'label': 'ML', 'pattern': 'stochastic cellular automaton'},
{'label': 'ML', 'pattern': 'stochastic gradient descent'},
{'label': 'ML', 'pattern': 'stochastic grammar'},
{'label': 'ML', 'pattern': 'stochastic matrix'},
{'label': 'ML', 'pattern': 'stochastic neural analog reinforcement calculator'},
{'label': 'ML', 'pattern': 'stochastic neural network'},
{'label': 'ML', 'pattern': 'stress majorization'},
{'label': 'ML', 'pattern': 'string kernel'},
{'label': 'ML', 'pattern': 'structural equation modeling'},
{'label': 'ML', 'pattern': 'structural risk minimization'},
{'label': 'ML', 'pattern': 'structured knn'},
{'label': 'ML', 'pattern': 'structured prediction'},
{'label': 'ML', 'pattern': 'structured sparsity regularization'},
{'label': 'ML', 'pattern': 'structured support vector machine'},
{'label': 'ML', 'pattern': 'stylegan'},
{'label': 'ML', 'pattern': 'subclass reachability'},
{'label': 'ML', 'pattern': 'sufficient dimension reduction'},
{'label': 'ML', 'pattern': 'sukhotins algorithm'},
{'label': 'ML', 'pattern': 'sum of absolute differences'},
{'label': 'ML', 'pattern': 'sum of absolute transformed differences'},
{'label': 'ML', 'pattern': 'supervised learning'},
{'label': 'ML', 'pattern': 'support vector machine'},
{'label': 'ML', 'pattern': 'swish function'},
{'label': 'ML', 'pattern': 'switching kalman filter'},
{'label': 'ML', 'pattern': 'symbolic regression'},
{'label': 'ML', 'pattern': 'synaptic transistor'},
{'label': 'ML', 'pattern': 'synaptic weight'},
{'label': 'ML', 'pattern': 'synchronous context-free grammar'},
{'label': 'ML', 'pattern': 'syntactic pattern recognition'},
{'label': 'ML', 'pattern': 't-distributed stochastic neighbor embedding'},
{'label': 'ML', 'pattern': 'taguchi loss function'},
{'label': 'ML', 'pattern': 'tastedive'},
{'label': 'ML', 'pattern': 'td-gammon'},
{'label': 'ML', 'pattern': 'teaching dimension'},
{'label': 'ML', 'pattern': 'temporal difference learning'},
{'label': 'ML', 'pattern': 'tensor product network'},
{'label': 'ML', 'pattern': 'tensor sketch'},
{'label': 'ML', 'pattern': 'tensorflow'},
{'label': 'ML', 'pattern': 'text mining'},
{'label': 'ML', 'pattern': 'textual case-based reasoning'},
{'label': 'ML', 'pattern': 'tf–idf'},
{'label': 'ML', 'pattern': 'the emotion machine'},
{'label': 'ML', 'pattern': 'the master algorithm'},
{'label': 'ML', 'pattern': 'theano'},
{'label': 'ML', 'pattern': 'theory of conjoint measurement'},
{'label': 'ML', 'pattern': 'thurstonian model'},
{'label': 'ML', 'pattern': 'time aware long short-term memory'},
{'label': 'ML', 'pattern': 'time delay neural network'},
{'label': 'ML', 'pattern': 'time series'},
{'label': 'ML', 'pattern': 'timeline of machine learning'},
{'label': 'ML', 'pattern': 'topic model'},
{'label': 'ML', 'pattern': 'training, validation, and test sets'},
{'label': 'ML', 'pattern': 'transduction'},
{'label': 'ML', 'pattern': 'transfer learning'},
{'label': 'ML', 'pattern': 'transformer'},
{'label': 'ML', 'pattern': 'trigram tagger'},
{'label': 'ML', 'pattern': 'triplet loss'},
{'label': 'ML', 'pattern': 'tsetlin machine'},
{'label': 'ML', 'pattern': 'tucker decomposition'},
{'label': 'ML', 'pattern': 'types of artificial neural networks'},
{'label': 'ML', 'pattern': 'u-matrix'},
{'label': 'ML', 'pattern': 'u-net'},
{'label': 'ML', 'pattern': 'ugly duckling theorem'},
{'label': 'ML', 'pattern': 'uncertain data'},
{'label': 'ML', 'pattern': 'under-fitting'},
{'label': 'ML', 'pattern': 'underfitting'},
{'label': 'ML', 'pattern': 'uniform convergence in probability'},
{'label': 'ML', 'pattern': 'unique negative dimension'},
{'label': 'ML', 'pattern': 'universal approximation theorem'},
{'label': 'ML', 'pattern': 'universal portfolio algorithm'},
{'label': 'ML', 'pattern': 'unsupervised learning'},
{'label': 'ML', 'pattern': 'user behavior analytics'},
{'label': 'ML', 'pattern': 'validation set'},
{'label': 'ML', 'pattern': 'vanishing gradient problem'},
{'label': 'ML', 'pattern': 'vapnik–chervonenkis dimension'},
{'label': 'ML', 'pattern': 'vapnik–chervonenkis theory'},
{'label': 'ML', 'pattern': 'variable kernel density estimation'},
{'label': 'ML', 'pattern': 'variable-order bayesian network'},
{'label': 'ML', 'pattern': 'variable-order markov model'},
{'label': 'ML', 'pattern': 'variational message passing'},
{'label': 'ML', 'pattern': 'vector quantization'},
{'label': 'ML', 'pattern': 'version space learning'},
{'label': 'ML', 'pattern': 'visual temporal attention'},
{'label': 'ML', 'pattern': 'viterbi algorithm'},
{'label': 'ML', 'pattern': 'waca clustering algorithm'},
{'label': 'ML', 'pattern': 'waifu2x'},
{'label': 'ML', 'pattern': 'wake-sleep algorithm'},
{'label': 'ML', 'pattern': 'wavenet'},
{'label': 'ML', 'pattern': 'weak supervision'},
{'label': 'ML', 'pattern': 'weighted majority algorithm'},
{'label': 'ML', 'pattern': 'whitening transformation'},
{'label': 'ML', 'pattern': 'witness set'},
{'label': 'ML', 'pattern': 'word embedding'},
{'label': 'ML', 'pattern': 'word2vec'},
{'label': 'ML', 'pattern': 'writer invariant'},
{'label': 'ML', 'pattern': 'zero-shot learning'}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')

nlp.to_disk(output_dir)
print('Saved model to: ', output_dir) 
Saved model to:  C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml

The rule-based matcher provides another step in the processing pipeline, so it can be readily combined with the existing NER recognizer. After completing the routine above, we are now able to invoke our new model which combines the existing en_core_web_sm model and the new ruler pipeline step into our new model, en_core_ml, as the code below shows. The following code takes our new model and uses it to generate a listing of the tags found in our input text:

import spacy
#import en_core_web_sm
import random
from spacy.util import minibatch, compounding
 
#nlp = en_core_web_sm.load()

model = r'C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml'
 
nlp = spacy.load(model)
#nlp.add_pipe(ruler)

ner=nlp.get_pipe('ner')

text = """With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, "data science") discussion, and the last part where significant code is developed and documented. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.
      Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.
      KBpedia's (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate 'slices' or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.
      Second, 80% of KBpedia's concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic 'signals'. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.
      And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The 'deep' appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.
      The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at 'standard' machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.
      The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.
      So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python 'ecosystems' and 'frameworks' in this part is to be better prepared to incorporate those innovations and learnings to come.
      Background
      One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.
      Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:
      Machine Learning Landscape
      Figure 1: Machine Learning Landscape (from S. Chen, "Machine Learning Algorithms For Beginners with Code Examples in Python", June 2020)
      There are many possible diagrams that one deep learning might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of 'classification' is a supervised one, 'clustering' a notion of unsupervised.
      We will include a 'standard' machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.
      """

move_names = list(ner.move_names)
assert nlp.get_pipe("ner").move_names == move_names
doc = nlp(text)
for ent in doc.ents:
    print(ent.label_, ent.text)
ORG the Cooking with Python
ML KBpedia series
LOC Part VI
CARDINAL seven
ML machine learning
CARDINAL 11
CARDINAL four
GPE roadmap
GPE KBpedia
GPE KBpedia
CARDINAL One
ORG KBpedia's
CARDINAL three
ML machine learning
ORDINAL First
CARDINAL about 56,000
GPE KBpedia
ML machine learning
GPE KBpedia
CARDINAL tens of millions
CARDINAL tens of thousands
GPE KBpedia
ORDINAL Second
PERCENT 80%
GPE KBpedia
ML word embedding
GPE KBpedia
GPE KBpedia
ML word embedding
ORDINAL third
ML machine learning
ML deep learning
ORDINAL first
ML machine learning
ML neural networks
ML neural networks
ML CNN
ML neural networks
ORG RNN
PERSON Graphs
ML deep learning
DATE the past five years
CARDINAL eleven
ORDINAL first
ORG Python
ML machine learning
PERSON PyTorch
CARDINAL four
ORG NLP
ML deep learning
CARDINAL two
CARDINAL four
GPE Clojure
GPE KBpedia
GPE AI
CARDINAL one
ORG Python
ORDINAL first
ML machine learning
PERSON Ronald Fisher
DATE the 1930s
GPE Iris
DATE today
ML linear discriminant analysis
CARDINAL dozens
ML machine learning
CARDINAL 1
CARDINAL one
PERSON ML
ML machine learning
NORP DL
ML deep learning
ORG Python
PERSON S. Chen
WORK_OF_ART Machine Learning Algorithms For Beginners with Code Examples in Python
DATE June 2020
CARDINAL one
ML deep learning
ML machine learning
ML supervised learning
ML unsupervised learning
ML reinforcement learning
ML machine learning
PRODUCT Graphs
ML machine learning
QUANTITY 5 yr
ML machine learning

If we want to see these tags in context to the original text, we can also invoke the visual annotator available in spaCy:

import spacy
from spacy import displacy
#import en_core_web_sm


model = r'C:\1-PythonProjects\kbpedia\v300\models\results\en_core_ml'

nlp = spacy.load(model)
#nlp = en_core_web_sm.load()

text = """With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, "data science") discussion, and the last part where significant code is developed and documented. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.
      Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.
      KBpedia's (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate 'slices' or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.
      Second, 80% of KBpedia's concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic 'signals'. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.
      And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The 'deep' appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.
      The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at 'standard' machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.
      The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.
      So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python 'ecosystems' and 'frameworks' in this part is to be better prepared to incorporate those innovations and learnings to come.
      Background
      One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.
      Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:
      Machine Learning Landscape
      Figure 1: Machine Learning Landscape (from S. Chen, "Machine Learning Algorithms For Beginners with Code Examples in Python", June 2020)
      There are many possible diagrams that one deep learning might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of 'classification' is a supervised one, 'clustering' a notion of unsupervised.
      We will include a 'standard' machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.
      """

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)
With this installment of the Cooking with PythonORGand KBpedia seriesML we move into Part VILOC of seven CARDINAL parts, a part with the bulk of the analytical and machine learning ML (that is, “data science”) discussion, and the last part where significant code is developed and documented. At the conclusion of this part, which itself has 11 CARDINAL installments, we have four CARDINAL installments to wrap up the series and provide a consistent roadmap GPE to the entire project. Knowledge graphs are unique information artifacts, and KBpedia GPE is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia GPE , but it is also a combination not duplicated anywhere else in the data science ecosystem. One CARDINAL of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics. KBpedia’s ORG (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three CARDINAL areas of data science and machine learning ML . First ORDINAL , the nearly universal scope and degree of topic coverage with about 56,000 CARDINAL concepts, logically organized into typologies with a high degree of disjointedness, means that accurate ‘slices’ or training sets may be extracted from KBpedia GPE nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning ML . We can extract these nearly for free from KBpedia GPE . Further, with its links to tens of millions CARDINAL of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands CARDINAL of conceptual entities in KBpedia GPE can be the retrieval points to nucleate training sets for fine-grained entity recognition. Second ORDINAL , 80% PERCENT of KBpedia GPE ‘s concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding ML models exist, the ones in KBpedia GPE are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic ‘signals’. To probe these assertions, we will create a unique KBpedia GPE -based word embedding ML corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets. And, third ORDINAL , perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning ML , especially innovations in geometric, heterogeneous methods for deep learning ML . The first ORDINAL generation of deep machine learning ML was designed for grid-patterned data and matrices through approaches such as deep neural networks ML , convolutional neural networks ML ( CNN ML ), or recurrent neural networks ML ( RNN ORG ). The ‘deep’ appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs PERSON , on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning ML , enables the efficient incorporation of text. It is only in the past five years DATE that concerted attention has been devoted to better capturing this feature richness for knowledge graphs. The eleven CARDINAL installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at ‘standard’ machine learners and deep learners. We will install the first ORDINAL generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions. The material below introduces and tees up these topics. We describe leading Python ORG packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning ML framework, PyTorch PERSON , to which we will then tie four CARDINAL different NLP ORG and deep learning ML libraries. We devote two CARDINAL installments each to these four CARDINAL libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure GPE posted online. So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia GPE and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI GPE and knowledge graphs. Thus, one CARDINAL of the reasons we emphasize Python ORG ‘ecosystems’ and ‘frameworks’ in this part is to be better prepared to incorporate those innovations and learnings to come. Background One of the first ORDINAL prototypes of machine learning ML comes from the statistician Ronald Fisher PERSON in the 1930s DATE regarding how to classify Iris GPE species based on the attributes of their flowers. It was a multivariate data example using the method we today DATE call linear discriminant analysis ML . This classic example is still taught. But many dozens CARDINAL of new algorithms and combined approaches have joined the machine learning ML field since then. Figure 1 CARDINAL below is one CARDINAL way to characterize the field, with ML PERSON standing for machine learning ML and DL NORP for deep learning ML , with this one oriented to sub-fields in which some Python ORG package already exists: Machine Learning Landscape Figure 1: Machine Learning Landscape (from S. Chen PERSON , ” Machine Learning Algorithms For Beginners with Code Examples in Python WORK_OF_ART “, June 2020 DATE ) There are many possible diagrams that one CARDINALdeep learning ML might prepare to show the machine learning ML landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning ML and unsupervised learning ML , (sometimes with reinforcement learning ML as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of ‘classification’ is a supervised one, ‘clustering’ a notion of unsupervised. We will include a ‘standard’ machine learning ML library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs PRODUCT pose a number of differences and challenges to standard machine learning ML . They have only been a recent ( 5 yr QUANTITY ) focus in machine learning ML , which is also rapidly changing over time.

We now see that our ‘ML’ tag has been added to the roster and other standard tags are shown.

Were this to be a production version, I would spend more time updating the training examples to remove some of the misassignments and would likely add some additional ML tags specific to our work with KBpedia (as opposed to the ones strictly from Wikipedia). Nonetheless, it seems like the rule-based approach is the better one for a topic area like ‘machine learning’ when we have a rather complete enumeration of important instances.

Other spaCy Functions

There is a wealth of additional functions that might be applied to KBpedia and its uses with the spaCy package. For example, this simple routine shows the variety of tags and characterizations that might be retrieved from text:

from spacy.gold import docs_to_json

doc = nlp("Machine learning is fun in Iowa.")
json_data = docs_to_json([doc])
print(json_data)
{'id': 0, 'paragraphs': [{'raw': 'Machine learning is fun in Iowa.', 'sentences': [{'tokens': [{'id': 0, 'orth': 'Machine', 'tag': 'NN', 'head': 1, 'dep': 'compound', 'ner': 'O'}, {'id': 1, 'orth': 'learning', 'tag': 'NN', 'head': 1, 'dep': 'nsubj', 'ner': 'O'}, {'id': 2, 'orth': 'is', 'tag': 'VBZ', 'head': 0, 'dep': 'ROOT', 'ner': 'O'}, {'id': 3, 'orth': 'fun', 'tag': 'JJ', 'head': -1, 'dep': 'attr', 'ner': 'O'}, {'id': 4, 'orth': 'in', 'tag': 'IN', 'head': -2, 'dep': 'prep', 'ner': 'O'}, {'id': 5, 'orth': 'Iowa', 'tag': 'NNP', 'head': -1, 'dep': 'pobj', 'ner': 'U-GPE'}, {'id': 6, 'orth': '.', 'tag': '.', 'head': -4, 'dep': 'punct', 'ner': 'O'}], 'brackets': []}], 'cats': []}]}

The combination of its rich functionality and pipeline abilities ensures spaCy is an NLP package of great capability. We could devote more write-ups to applications like topic modeling, word sense disambiguation, or relation extraction, but we need to move on in the next installment to classic machine learning.

Additional Documentation

Here is additional documentation in support of this installment.

Embeddings and Transformers
Text Summarization
NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.

NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.

I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on November 12, 2020 at 11:47 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2416/cwpk-64-embeddings-summarization-and-entity-recognition/
The URI to trackback this post is: https://www.mkbergman.com/2416/cwpk-64-embeddings-summarization-and-entity-recognition/trackback/
Posted:November 9, 2020

Clean Corpora and Datasets are a Major Part of the Effort

With our discussions of network analysis and knowledge extractions from our knowledge graph now behind us, we are ready to tackle the questions of analytic applications and machine learning in earnest for our Cooking with Python and KBpedia series. We will be devoting our next nine installments to this area. We devote two installments to data sources and input preparations, largely based on NLP (natural language processing) applications. Then we devote two installments to ‘standard’ machine learning (largely) using the scikit-learn packages. We next devote four installments to deep learning, split equally between the Deep Learning Graph (DGL) and PyTorch Geometric (PyG) frameworks. We conclude this Part VI with a summary and comparison of results across these installments based on the task of node classification.

In this particular installment we flesh out the plan for completing these installments and discuss data sources and completing data prep needed for the plan. We provide particular attention to the architecture and data flows within the PyTorch framework. We describe the additional Python packages we need for this work, and install and configure the first ones. We discuss general sources of data and corpora useful for machine learning purposes. Our coding efforts in this installment will obtain and clean the Wikipedia pages that supplement the two structural and annotation sources based on KBpedia that were covered in the prior installment. These three sources of structure, annotations and pages are the input basis to creating our own embeddings to be used in many of the machine learning tests.

Plan for Completion of Part VI

The broad ecosystem of Python packages I was considering looked, generally, to be good choices to work together, as first outlined in CWPK #61. I had done an adequate initial diligence. But, how all of this was to unfold, what my plan of attack should be, became driving factors I had to solve to shorten my development and coding efforts. So, with an understanding of how we could extract general information from KBpedia useful to analysis and machine learning, I needed to project out over the entire anticipated scope to see if, indeed, these initial sources looked to be the right ones for our purposes. And, if so, how shall the efforts be sequenced and what is the flow of data?

Much reading and research went into this effort. It is true, for example, that we had already prepared a pretty robust series of analytic and machine learning case studies in Clojure, available from the KBpedia Web site. I revisited each of these use cases and got some ideas of what made sense for us to attempt with Python. But I needed to understand the capabilities now available to us with Python, so I also studied each of the candidate keystone packages in some detail.

I will weave the results of this research as the next installments unfold, providing background discussion in context and as appropriate. But, in total, I formulated about 30 tasks going forward that appeared necessary to cover the defined scope. The listing below summarizes these steps, and keys the transition point (as indicated by CWPK installment number) for proceeding to each next new installment:

  1. Formulate Part VI plan
  2. Extract two source files from KBpedia
    • structure
    • annotations
  3. Set environment up (not doing virtual)
  4. Obtain Wikipedia articles for matching RCs
  5. Set up gensim
  6. Clean Wikipedia articles, all KB annotations
  7. Set up spaCy
  8. ID, extract phrases
  9. Finish embeddings prep #64
    • remove stoplist
    • create numeric??
  10. Create embedding models:
    • word2vec and doc2vec
  11. Text summarization for short articles (gensim)
  12. Named entity recognition
  13. Set up scikit-learn #65
  14. Create master pandas file
  15. Do event/action extraction
  16. Do scikit-learn classifier #66
    • SVM
    • k-nearest neighbors
    • random forests
  17. Introduce the sklearn.metrics module and confusion matrix, etc. The standard for reporting
  18. Discuss basic test parameters/’gold standars’
  19. Knowledge graph embeddings #67
  20. Create embedding models -2
    • KB-struct
    • KB-annot
    • KB-annot-full: what is above + below
    • KB-annot-page
  21. Set up PyTorch/DLG-KE #68
  22. Set up PyTorch/PyG
  23. Formulate learning pathway/code
  24. Do standard DL classifiers: #69
    • TransE
    • TransR
    • RESCAL
    • DistMult
    • ComplEx
    • RotatE
  25. Do research DL classifiers: #70
    • VAE
    • GGSNN
    • MPNN
    • ChebyNet
    • GCN
    • SAGE
    • GAT
  26. Choose a model evaluator: #71
    • scikit-learn
    • pyTorch
    • other?
  27. Collate prior results
  28. Evaluate prior results
  29. Present comparative results

Some of these steps also needed some preliminary research before proceeding. For example, knowing I wanted to compare results across algorithms meant I needed to have a good understanding of testing and analysis requirements before starting any of the tests.

PyTorch Architecture

A critical question in contemplating this plan was how exactly data needed to be produced, staged, and then fed into the analysis portions. From the earlier investigations I had identified the three categories of knowledge grounded in KBpedia that could act as bases or features to machine learning; namely, structure, annotations and pages. I also had identified PyTorch as a shared abstraction layer for deep and machine learning.

I was particularly focused on the question of data formats and representations such that information could be readily passed from one step to the next in the analysis pipeline. Figure 1 is the resulting data flow chart and architecture that arose from these investigations.

First, the green block labeled ‘owlready2’ represents that Python package, but also the location where the intact knowledge graph of KBpedia is stored and accessed. As early installments covered, we can use either owlready2 or Protégé to manage this knowledge graph, though owlready2 is the point at which the KBpedia information is exported or extracted for downstream uses, importantly machine learning. As our owlready2 discussions also indicated, there is a close relationship between it and RDFLib (which is also the SPARQL access point). RDFLib can provide direct input into NetworkX, but that is limited to structure only.

The clearest common denominator format for entry into the machine learning pipeline is pandas via CSV files. This centrality is fortunate given that all of our prior KBpedia extract-and-build routines have been designed around this format. This format is also one of the direct feeds possible into the PyTorch datasets format, as the figure shows:

Data Flows in Machine Learning and KG Analysis
Figure 1: Data Flows in Machine Learning and Knowledge Graph Analysis

An important block on the figure is for ’embeddings’. If you recall, all text needs to first be encoded to a numeric form to be understood by the computer. This process can also undertake dimensionality reduction, important for a sparse matrix data form like language. This same ability can be applied to graph structure and interactions. Thus, the ’embedding’ block is a pivotal point at which we can represent words, sentences, paragraphs, documents, nodes, or entire graphs. We will focus much on embeddings throughout this Part VI.

For training purposes we can also feed pre-trained corpora or embeddings into the system. We address this topic in the next main section.

Figure 1 is not meant to be a comprehensive view of PyTorch, but it is one useful to understand data flows with respect to our use of the KBpedia knowledge graph. Over the course of this research, I also encountered many PyTorch-related extensions that, when warranted, I include in the discussion.

Possible Extensions

There are some extensions to the PyTorch ecosystem that we will not be using or testing in this CWPK series. Here are some of the ones that seem closest in capabilities to what we are doing with KBpedia:

  • PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy, and many more
  • PiePline is a neural networks training pipeline based on PyTorch. Designed to standardize training process and accelerate experiments
  • Catalyst helps to write full-featured deep learning pipelines in a few lines of code
  • Poutyne is a Keras-like framework for PyTorch and handles much of the boilerplating code needed to train neural networks
  • torchtext has some capabilities in language modeling, sentiment analysis, text classification, question classification, entailment, machine translation, sequence tagging, question answering, and unsupervised learning
  • Spotlight uses PyTorch to build both deep and shallow recommender models.

Corpora and Datasets

There are many off-the-shelf resources that can be of use when doing machine learning involving text and language. (There are as well for images, but that is out of scope to our current interests.) These resources fall into three main areas:

  • corpora – are language resources of either a general or domain nature, with vetted relationships or annotations between terms and concepts or other pre-processing useful to computational linguistics
  • pre-trained models – are pre-calculated language models, often expressing probability distributions over words or text. Some embeddings can act in this manner. Transformers use deep learning to train their representations, with BERT being a notable example
  • embeddings – are vector representations of chunks of text, ranging from individual words up to entire documents or languages. The numeric representation either represents a pooled statistical representation across all tokens (the so-called CBOW approach) or context and adjacency using the skip-gram or similar method. GloVe, word2vec and fastText are example methodologies for producing word embeddings.

Example corpora include Wikipedia (in multiple languages), news articles, Web crawls, and many others. Such corpora can be used as the language input basis for training various models, or may be a reference vocabulary for scoring and ranking input text. Various pre-trained language models are available, and embedding methods are available in a number of Python packages, including scikit-learn, gensim and spaCy used in cowpoke.

Pre-trained Resources

There are a number of free or open-source resources for these corpora or datasets. Some include:

Setting Up the Environment

In doing this research, I also assembled the list of Python packages needed to add these capabilities to cowpoke. Had I not just updated the conda packages, I would do so now:

conda update --all

Next, the general recommendation when installing multiple new packages in Python is to do them in one batch, which allows the package manager (conda in our circumstance) to check on version conflicts and compatibility during the install process. However, with some of the packages involved in the current expansion, there are other settings necessary that obviates this standard ‘batch’ install recommendation.

Another note is important here. In an enterprise environment with many Python projects, it is also best to install these machine learning extensions into their own virtual environment. (I covered this topic a bit in CWPK #58.) However, since we are keeping this entire series in its own environment, we will skip that step here. You may prefer the virtual option.

So, we will begin with those Python packages and frameworks that pose their own unique set-up and install challenges. We begin with PyTorch. We need to first appreciate that the rationale for PyTorch was to abstract machine learning constructs while taking advantage of graphics processing units (GPUs) (specifically, Nvidia via the CUDA interface). The CUDA architecture provides one or two orders of magnitude speed up on a local machine. Unfortunately, my local Windows machine does not have the separate Nvidia GPU, so I want to install the no CUDA option. For the PyTorch install options, visit https://pytorch.org/get-started/locally/. This figure shows my selections prior to download (yours may vary):

PyTorch Download Screen
Figure 2: PyTorch Download Screen

In my circumstance, my local machine does not have a separate graphics processor, so I set the CUDA requirement to ‘None’ (1). I also removed the ‘torchvision’ command line specification (2) since that is an image-related package. (We may later need some libraries from this package, in which case we will then install it.) The PyTorch package is rather large, so install takes a few minutes. Here is the actual install command:

conda install pytorch cpuonly -c pytorch

Since we were not able to batch all new packages, I decide to continue with some of the other major additions in a sequential matter, with spaCy and its installation next:

conda install -c conda-forge spacy

and then gensim and its installation:

conda install -c conda-forge gensim

and then DLG, which has an installation screen similar to PyTorch in Figure 2 with the same picked options:

conda install -c dglteam dgl

The DLG-KE extension needs to be built from source for Windows, so we will hold off on that now until we need it. We next install PyTorch Geometric, which needs to be installed from a series of binaries, with CPU or GPU individually specified:

pip install torch-scatter==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-sparse==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-cluster==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-spline-conv==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-geometric

These new packages join these that are already a part of my local conda packages, and which will arise in the coming installments:

scikit-learn and tqdm.

Getting Wikipedia Pages

With these preliminaries complete, we are now ready to resume our data preparation tasks for our embedding and machine learning experiments. In the prior installment, we discussed two of the three source files we had identified for these efforts, the KBpedia structure (kbpedia/v300/extractions/data/graph_specs.csv) and the KBpedia annotations (kbpedia/v300/extractions/classes/Generals_annot_out.csv) files. In this specific section we obtain the third source file of pages from Wikipedia.

Of the 58,000 reference concepts presently contained in KBpedia, about 45,000 have a directly corresponding Wikipedia article or listing of category articles. These provide a potentially rich source of content for language models and embeddings. The challenge is how to obtain this content in a way that can be readily processed for our purposes.

We have been working with Wikipedia since its inception, so we knew that there are data sources for downloads or dumps. For example, the periodic language dumps such as https://dumps.wikimedia.org/enwiki/20200920/ may be accessed to obtain full-text versions of articles. Such dumps have been used scores of times to produce Wikipedia corpora in many different languages and for many different purposes. But, our own mappings are a mere subset, about 1% of the nearly 6 million articles in the English Wikipedia alone. So, even if we grabbed the current dump or one of the corpora so derived, we would need to process much content to obtain the subset of interest.

Unfortunately, Wikipedia does not have a direct query or SPARQL form as exists for Wikidata (which also does not have full-text articles). We could obtain the so-called ‘long abstracts’ of Wikipedia pages from DBpedia (see, for example, https://wiki.dbpedia.org/downloads-2016-10), but this source is dated and each abstract is limited to about 220 words; further, a full download of the specific file in English is about 15 GB!

The basic approach, then, appeared that I would need to download the full Wikipedia article file, figure out how to split it into parts, and then match identifiers between KBpedia mappings and the full dataset to obtain the articles of interest. This approach is not technically difficult, but it is a real pain in the ass.

So, shortly before I committed to this work effort, I challenged myself to find another way that was perhaps less onerous. Fortunately, I found the online Wikipedia service, https://en.wikipedia.org/wiki/Special:Export, that allows one to submit article names to a text box and then get the full page article back in XML format. I tested this online service with a few articles, then 100, and then ramped up to a listing of 5 K at a time. (Similar services often have governors that limit the frequency or amounts of individual requests.) This approach worked!, and within 30 min I had full articles in nine separate batches for all 45 K items in KBpedia.

Clean All Input Text

This file is a single article from the Wikipedia English dump for 1-(2-Nitrophenoxy)octane:

<page>
    <title>1-(2-Nitrophenoxy)octane</title>
    <ns>0</ns>
    <id>11793192</id>
    <revision>
      <id>891140188</id>
      <parentid>802024542</parentid>
      <timestamp>2019-04-05T23:04:47Z</timestamp>
      <contributor>
        <username>Koavf</username>
        <id>205121</id>
      </contributor>
      <minor/>
      <comment>/* top */Replace HTML with MediaWiki markup or templates, replaced: &lt;sub&gt; → {{sub| (3), &lt;/sub&gt; → }} (3)</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="2029" xml:space="preserve">{{chembox
| Watchedfields = changed
| verifiedrevid = 477206849
| ImageFile =Nitrophenoxyoctane.png
| ImageSize =240px
| ImageFile1 = 1-(2-Nitrophenoxy)octane-3D-spacefill.png
| ImageSize1 = 220
| ImageAlt1 = NPOE molecule
| PIN = 1-Nitro-2-(octyloxy)benzene
| OtherNames = 1-(2-Nitrophenoxy)octane&lt;br /&gt;2-Nitrophenyl octyl ether&lt;br /&gt;1-Nitro-2-octoxy-benzene&lt;br /&gt;2-(Octyloxy)nitrobenzene&lt;br /&gt;Octyl o-nitrophenyl ether
|Section1={{Chembox Identifiers
| Abbreviations =NPOE
| ChemSpiderID_Ref = {{chemspidercite|correct|chemspider}}
| ChemSpiderID = 148623
| InChI = 1/C14H21NO3/c1-2-3-4-5-6-9-12-18-14-11-8-7-10-13(14)15(16)17/h7-8,10-11H,2-6,9,12H2,1H3
| InChIKey = CXVOIIMJZFREMM-UHFFFAOYAD
| StdInChI_Ref = {{stdinchicite|correct|chemspider}}
| StdInChI = 1S/C14H21NO3/c1-2-3-4-5-6-9-12-18-14-11-8-7-10-13(14)15(16)17/h7-8,10-11H,2-6,9,12H2,1H3
| StdInChIKey_Ref = {{stdinchicite|correct|chemspider}}
| StdInChIKey = CXVOIIMJZFREMM-UHFFFAOYSA-N
| CASNo_Ref = {{cascite|correct|CAS}}
| CASNo =37682-29-4
| PubChem =169952
| SMILES = [O-][N+](=O)c1ccccc1OCCCCCCCC
}}
|Section2={{Chembox Properties
| Formula =C{{sub|14}}H{{sub|21}}NO{{sub|3}}
| MolarMass =251.321
| Appearance =
| Density =1.04 g/mL
| MeltingPt =
| BoilingPtC = 197 to 198
| BoilingPt_notes = (11 mm Hg)
| Solubility =
  }}
|Section3={{Chembox Hazards
| MainHazards =
| FlashPt =
| AutoignitionPt = 
 }}
}}

'''1-(2-Nitrophenoxy)octane''', also known as '''nitrophenyl octyl ether''' and abbreviated '''NPOE''', is a 
[[chemical compound]] that is used as a matrix in [[fast atom bombardment]] [[mass spectrometry]], liquid 
[[secondary ion mass spectrometry]], and as a highly [[lipophilic]] [[plasticizer]] in [[polymer]] 
[[Polymeric membrane|membranes]] used in [[ion selective electrode]]s.

== See also ==

* [[Glycerol]]
* [[3-Mercaptopropane-1,2-diol]]
* [[3-Nitrobenzyl alcohol]]
* [[18-Crown-6]]
* [[Sulfolane]]
* [[Diethanolamine]]
* [[Triethanolamine]]

{{DEFAULTSORT:Nitrophenoxy)octane, 1-(2-}}
[[Category:Nitrobenzenes]]
[[Category:Phenol ethers]]</text>
      <sha1>0n15t2w0sp7a50fjptoytuyus0vsrww</sha1>
    </revision>
  </page>

We want to extract out the specific article text (denoted by the <text> field), perhaps capture some other specific fields, remove internal tags, and then create a clean text representation that we can further process. This additional processing includes removing stoplist words, finding and identifying phrases (multiple token chunks), and then tokenizing the text suitable for processing as computer input.

There are multiple methods available for this kind of processing. One approach, for example, uses XML parsing and specific code geared to the Wikipedia dump. Another approach uses a dedicated Wikipedia extractor. There are actually a few variants of dedicated extractors.

However, one particular Python package, gensim, provides multiple utilities and Wikipedia services. Since I had already identified gensim to provide services like sentiment analysis and some other NLP capabilities, I chose to focus on using this package for the needed Wikipedia cleaning tasks.

Gensim has a gensim.corpora.wikicorpus.WikiCorpus class designed specifically for processing the Wikipedia article dump file. Fortunately, I was able to find some example code on KDnuggets that showed the way in how to process this file

However, prior to using gensim, I needed to combine the batch outputs from my Wikipedia page retrievals into a single xml file, which I could then bzip for direct ingest by gensim. (Most gensim models and capabilities can read either bzip or text files.)

Each 5 K xml page retrieval from Wikipedia comes with its own header and closing tags. These need to be manually snipped out of the group retrieval files before combining. We prepared these into nine blocks that corresponded to each of the batch Wikipedia retrievals, and retained the header and closing tags in the first and last files respectively:

NOTE: Due to GitHub’s file size limits (of 100 MB max), the nine text files listed in the next routine have been zipped and uploaded to kbpedia.org/cwpk-text/Wikipedia-pages-1.zip. To use these files, you will need to download to your local system and unzip. You will need to increment the zip files up to #9. Then, all following routines below must be repeated locally in order to progress through the various cleaning and preparation steps.
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-pages-full.xml'
filenames = [r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-1.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-2.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-3.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-4.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-5.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-6.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-7.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-8.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-9.txt']
with open(out_f, 'w', encoding='utf-8') as outfile:
    for fname in filenames:
        with open(fname, encoding='utf-8', errors='ignore') as infile:
            i = 0
            for line in infile:
                i = i + 1
                try:
                    outfile.write(line)
                except Exception as e:
                    print('Error at line:' + i + str(e))
            print('Now combined:' + fname)
    outfile.close 
    print('Now all files combined!')            

The output of this routine is then bzipped offline, and then used as the submission to the gensim WikiCorpus function that processes the standard xml output:

"""
Creates a corpus from Wikipedia dump file.
Inspired by:
https://github.com/panyang/Wikipedia_Word2vec/blob/master/v1/process_wiki.py
"""

import sys
from gensim.corpora import WikiCorpus

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-pages-full.xml.bz2'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full.txt'

def make_corpus(in_f, out_f):

    """Convert Wikipedia xml dump file to text corpus"""

    output = open(out_f, 'w', encoding='utf-8')            # made change
    wiki = WikiCorpus(in_f)
    i = 0
    for text in wiki.get_texts():
        try:
            output.write(' '.join(map(lambda x:x.decode('utf-8'), text)) + '\n')
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('Processed ' + str(i) + ' articles')
    output.close()
    print('Processed ' + str(i) + ' articles;')
    print('Processing complete!')

make_corpus(in_f, out_f)

We further make a smaller input file, enwiki-test-corpus.xml.bz2, with only a few records from the Wikipedia XML dump in order to speed testing of the above code.

Initial Results

Here is what the sample program produced for the entry for 1-(2-Nitrophenoxy)octane listed above:

nitrophenoxy octane also known as nitrophenyl octyl ether and abbreviated npoe is chemical compound that is used as matrix in fast atom bombardment mass spectrometry liquid secondary ion mass spectrometry and as highly lipophilic plasticizer in polymer membranes used in ion selective electrodes see also glycerol mercaptopropane diol nitrobenzyl alcohol crown sulfolane diethanolamine triethanolamine

We see a couple of things that are perhaps not in keeping with the extracted information we desire:

  1. No title
  2. No sentence boundaries
  3. No internal category links
  4. No infobox specifications

On the other hand, we do get the content from the ‘See Also’ section.

We want sentence boundaries for cleaner training purposes for word embedding models like word2vec. We want the other items so as to improve the lexical richness and context for the given concept. Further, we want two versions: one with titles as a separate field and one for learning purposes that includes the title in the lexicon (titles, after all, are preferred labels and deserve an additional frequency boost).

OK, so how does one make these modifications? My first hope was that arguments to these functions (args) might provide the specification latitude to deal with these changes. Unfortunately, none of the specified items fell into this category, though there is much latitude to modify underlying procedures. The second option was to find some third-party modification or override. Indeed, I did find one, that I found quite intriguing as a way to at least deal with sentence boundaries and possibly other areas. I spent nearly a full day trying to adapt this script, never succeeding. One fix would lead to another need for a fix, research on that problem, and then a fix and more problems. I’m sure most all of this is due to my amateur programming skills.

Still, the effort was frustrating. The good thing, however, is that in trying to work out a third-party fix, I was learning the underlying module. Eventually, it became clear if I was to address all desired areas it was smartest to modify the source directly. The three key functions that emerged as needing attention were tokenize, process_article and the class WikiCorpus(TextCorpus) code. In fact, it was the text processing heart of the last class that was the focus for changes, but the other two functions got involved because of their supporting roles. As I attempted to sub-class this basis with my own parallel approach (class KBWikiCorpus(WikiCorpus), I kept finding the need to bring into the picture more supporting functions. Some of this may have been due to nuances in how to specify imported functions and modules, which I am still learning about (see concluding installments). But it is also difficult to sub-set or modify any code.

The real impact of these investigations was to help me understand the underlying module. What at first blush looked too intimidating, now was becoming understandable. I could also see other portions of the underlying module that addressed ALL aspects of my earlier desires. Third-party modifications choose their own scope; direct modification of the underlying module provides more aspects to tweak. So, I switched emphasis from modifying a third-party overlay to directly changing the core underlying module.

Modifying WikiCorpus

We already knew the key functions needing focus. All changes to be made occur in the wikicorpus.py file that resides in your gensim package directory under Python packages. So, I make a copy of the original and name it such, then proceed to modify the base file. Though we will substitute this modified wikicorpus_kb.py file, I will also keep a backup of it as well such that we have copies of the original and modified file.

Here is the resulting modified code, with notes about key changes following the listing:

with open('files/wikicorpus_kb.py', 'r') as f:
    print(f.read())
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Copyright (C) 2012 Lars Buitinck <larsmans@gmail.com>
# Copyright (C) 2018 Emmanouil Stergiadis <em.stergiadis@gmail.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

Uses multiprocessing internally to parallelize the work and process the dump more quickly.

Notes
-----
If you have the `pattern <https://github.com/clips/pattern>`_ package installed,
this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer).

See :mod:`gensim.scripts.make_wiki` for a canned (example) command-line script based on this module.

"""

import bz2
import logging
import multiprocessing
import re
import signal
from pickle import PicklingError
# LXML isn't faster, so let's go with the built-in solution
try:
from xml.etree.cElementTree import iterparse
except ImportError:
from xml.etree.ElementTree import iterparse


from gensim import utils
# cannot import whole gensim.corpora, because that imports wikicorpus...
from gensim.corpora.dictionary import Dictionary
from gensim.corpora.textcorpus import TextCorpus

from six import raise_from


logger = logging.getLogger(__name__)

ARTICLE_MIN_WORDS = 50
"""Ignore shorter articles (after full preprocessing)."""

# default thresholds for lengths of individual tokens
TOKEN_MIN_LEN = 2
TOKEN_MAX_LEN = 15

RE_P0 = re.compile(r'<!--.*?-->', re.DOTALL | re.UNICODE)
"""Comments."""
RE_P1 = re.compile(r'<ref([> ].*?)(</ref>|/>)', re.DOTALL | re.UNICODE)
"""Footnotes."""
RE_P2 = re.compile(r'(\n\[\[[a-z][a-z][\w-]*:[^:\]]+\]\])+$', re.UNICODE)
"""Links to languages."""
RE_P3 = re.compile(r'{{([^}{]*)}}', re.DOTALL | re.UNICODE)
"""Template."""
RE_P4 = re.compile(r'{{([^}]*)}}', re.DOTALL | re.UNICODE)
"""Template."""
RE_P5 = re.compile(r'\[(\w+):\/\/(.*?)(( (.*?))|())\]', re.UNICODE)
"""Remove URL, keep description."""
RE_P6 = re.compile(r'\[([^][]*)\|([^][]*)\]', re.DOTALL | re.UNICODE)
"""Simplify links, keep description."""
RE_P7 = re.compile(r'\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
"""Keep description of images."""
RE_P8 = re.compile(r'\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
"""Keep description of files."""
RE_P9 = re.compile(r'<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL | re.UNICODE)
"""External links."""
RE_P10 = re.compile(r'<math([> ].*?)(</math>|/>)', re.DOTALL | re.UNICODE)
"""Math content."""
RE_P11 = re.compile(r'<(.*?)>', re.DOTALL | re.UNICODE)
"""All other tags."""
RE_P12 = re.compile(r'(({\|)|(\|-(?!\d))|(\|}))(.*?)(?=\n)', re.UNICODE)
"""Table formatting."""
RE_P13 = re.compile(r'(?<=(\n[ ])|(\n\n)|([ ]{2})|(.\n)|(.\t))(\||\!)([^[\]\n]*?\|)*', re.UNICODE)
"""Table cell formatting."""
RE_P14 = re.compile(r'\[\[Category:[^][]*\]\]', re.UNICODE)
"""Categories."""
RE_P15 = re.compile(r'\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)
"""Remove File and Image templates."""
RE_P16 = re.compile(r'\[{2}(.*?)\]{2}', re.UNICODE)
"""Capture interlinks text and article linked"""
RE_P17 = re.compile(
r'(\n.{0,4}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=)|(scope=))(.*))|'
r'(^.{0,2}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))',
re.UNICODE
)
"""Table markup"""
IGNORED_NAMESPACES = [
'Wikipedia', 'Category', 'File', 'Portal', 'Template',
'MediaWiki', 'User', 'Help', 'Book', 'Draft', 'WikiProject',
'Special', 'Talk'
]
"""MediaWiki namespaces that ought to be ignored."""


def filter_example(elem, text, *args, **kwargs):
"""Example function for filtering arbitrary documents from wikipedia dump.


The custom filter function is called _before_ tokenisation and should work on
the raw text and/or XML element information.

The filter function gets the entire context of the XML element passed into it,
but you can of course choose not the use some or all parts of the context. Please
refer to :func:`gensim.corpora.wikicorpus.extract_pages` for the exact details
of the page context.

Parameters
----------
elem : etree.Element
XML etree element
text : str
The text of the XML node
namespace : str
XML namespace of the XML element
title : str
Page title
page_tag : str
XPath expression for page.
text_path : str
XPath expression for text.
title_path : str
XPath expression for title.
ns_path : str
XPath expression for namespace.
pageid_path : str
XPath expression for page id.

Example
-------
.. sourcecode:: pycon

>>> import gensim.corpora
>>> filter_func = gensim.corpora.wikicorpus.filter_example
>>> dewiki = gensim.corpora.WikiCorpus(
... './dewiki-20180520-pages-articles-multistream.xml.bz2',
... filter_articles=filter_func)

"""
# Filter German wikipedia dump for articles that are marked either as
# Lesenswert (featured) or Exzellent (excellent) by wikipedia editors.
# *********************
# regex is in the function call so that we do not pollute the wikicorpus
# namespace do not do this in production as this function is called for
# every element in the wiki dump
_regex_de_excellent = re.compile(r'.*\{\{(Exzellent.*?)\}\}[\s]*', flags=re.DOTALL)
_regex_de_featured = re.compile(r'.*\{\{(Lesenswert.*?)\}\}[\s]*', flags=re.DOTALL)

if text is None:
return False
if _regex_de_excellent.match(text) or _regex_de_featured.match(text):
return True
else:
return False


def find_interlinks(raw):
"""Find all interlinks to other articles in the dump.

Parameters
----------
raw : str
Unicode or utf-8 encoded string.

Returns
-------
list
List of tuples in format [(linked article, the actual text found), ...].

"""
filtered = filter_wiki(raw, promote_remaining=False, simplify_links=False)
interlinks_raw = re.findall(RE_P16, filtered)

interlinks = []
for parts in [i.split('|') for i in interlinks_raw]:
actual_title = parts[0]
try:
interlink_text = parts[1]
except IndexError:
interlink_text = actual_title
interlink_tuple = (actual_title, interlink_text)
interlinks.append(interlink_tuple)

legit_interlinks = [(i, j) for i, j in interlinks if '[' not in i and ']' not in i]
return legit_interlinks


def filter_wiki(raw, promote_remaining=True, simplify_links=True):
"""Filter out wiki markup from `raw`, leaving only text.

Parameters
----------
raw : str
Unicode or utf-8 encoded string.
promote_remaining : bool
Whether uncaught markup should be promoted to plain text.
simplify_links : bool
Whether links should be simplified keeping only their description text.

Returns
-------
str
`raw` without markup.

"""
# parsing of the wiki markup is not perfect, but sufficient for our purposes
# contributions to improving this code are welcome :)
text = utils.to_unicode(raw, 'utf8', errors='ignore')
text = utils.decode_htmlentities(text) # '&amp;nbsp;' --> '\xa0'
return remove_markup(text, promote_remaining, simplify_links)


def remove_markup(text, promote_remaining=True, simplify_links=True):
"""Filter out wiki markup from `text`, leaving only text.

Parameters
----------
text : str
String containing markup.
promote_remaining : bool
Whether uncaught markup should be promoted to plain text.
simplify_links : bool
Whether links should be simplified keeping only their description text.

Returns
-------
str
`text` without markup.

"""
text = re.sub(RE_P2, '', text) # remove the last list (=languages)
# the wiki markup is recursive (markup inside markup etc)
# instead of writing a recursive grammar, here we deal with that by removing
# markup in a loop, starting with inner-most expressions and working outwards,
# for as long as something changes.
# text = remove_template(text) # Note
text = remove_file(text)
iters = 0
while True:
old, iters = text, iters + 1
text = re.sub(RE_P0, '', text) # remove comments
text = re.sub(RE_P1, '', text) # remove footnotes
text = re.sub(RE_P9, '', text) # remove outside links
text = re.sub(RE_P10, '', text) # remove math content
text = re.sub(RE_P11, '', text) # remove all remaining tags
# text = re.sub(RE_P14, '', text) # remove categories # Note
text = re.sub(RE_P5, '\\3', text) # remove urls, keep description

if simplify_links:
text = re.sub(RE_P6, '\\2', text) # simplify links, keep description only
# remove table markup
text = text.replace("!!", "\n|") # each table head cell on a separate line
text = text.replace("|-||", "\n|") # for cases where a cell is filled with '-'
text = re.sub(RE_P12, '\n', text) # remove formatting lines
text = text.replace('|||', '|\n|') # each table cell on a separate line(where |{{a|b}}||cell-content)
text = text.replace('||', '\n|') # each table cell on a separate line
text = re.sub(RE_P13, '\n', text) # leave only cell content
text = re.sub(RE_P17, '\n', text) # remove formatting lines

# remove empty mark-up
text = text.replace('[]', '')
# stop if nothing changed between two iterations or after a fixed number of iterations
if old == text or iters > 2:
break

if promote_remaining:
text = text.replace('[', '').replace(']', '') # promote all remaining markup to plain text

return text


def remove_template(s):
"""Remove template wikimedia markup.

Parameters
----------
s : str
String containing markup template.

Returns
-------
str
Сopy of `s` with all the `wikimedia markup template <http://meta.wikimedia.org/wiki/Help:Template>`_ removed.

Notes
-----
Since template can be nested, it is difficult remove them using regular expressions.

"""
# Find the start and end position of each template by finding the opening
# '{{' and closing '}}'
n_open, n_close = 0, 0
starts, ends = [], [-1]
in_template = False
prev_c = None
for i, c in enumerate(s):
if not in_template:
if c == '{' and c == prev_c:
starts.append(i - 1)
in_template = True
n_open = 1
if in_template:
if c == '{':
n_open += 1
elif c == '}':
n_close += 1
if n_open == n_close:
ends.append(i)
in_template = False
n_open, n_close = 0, 0
prev_c = c

# Remove all the templates
starts.append(None)
return ''.join(s[end + 1:start] for end, start in zip(ends, starts))


def remove_file(s):
"""Remove the 'File:' and 'Image:' markup, keeping the file caption.

Parameters
----------
s : str
String containing 'File:' and 'Image:' markup.

Returns
-------
str
Сopy of `s` with all the 'File:' and 'Image:' markup replaced by their `corresponding captions
<http://www.mediawiki.org/wiki/Help:Images>`_.

"""
# The regex RE_P15 match a File: or Image: markup
for match in re.finditer(RE_P15, s):
m = match.group(0)
caption = m[:-2].split('|')[-1]
s = s.replace(m, caption, 1)
return s

def tokenize(content):
# ORIGINAL VERSION
#def tokenize(content, token_min_len=TOKEN_MIN_LEN, token_max_len=TOKEN_MAX_LEN, lower=True):
"""Tokenize a piece of text from Wikipedia.

Set `token_min_len`, `token_max_len` as character length (not bytes!) thresholds for individual tokens.

Parameters
----------
content : str
String without markup (see :func:`~gensim.corpora.wikicorpus.filter_wiki`).
token_min_len : int
Minimal token length.
token_max_len : int
Maximal token length.
lower : bool
Convert `content` to lower case?

Returns
-------
list of str
List of tokens from `content`.

"""
# ORIGINAL VERSION
# TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)
# return [
# utils.to_unicode(token) for token in utils.tokenize(content, lower=lower, errors='ignore')
# if token_min_len <= len(token) <= token_max_len and not token.startswith('_')
# ]
# NEW VERSION
return [token.encode('utf8') for token in utils.tokenize(content, lower=True, errors='ignore')
if len(token) <= 15 and not token.startswith('_')]
# TO RESTORE MOST PUNCTUATION
# return [token.encode('utf8') for token in content.split()
# if len(token) <= 15 and not token.startswith('_')]

def get_namespace(tag):
"""Get the namespace of tag.

Parameters
----------
tag : str
Namespace or tag.

Returns
-------
str
Matched namespace or tag.

"""
m = re.match("^{(.*?)}", tag)
namespace = m.group(1) if m else ""
if not namespace.startswith("http://www.mediawiki.org/xml/export-"):
raise ValueError("%s not recognized as MediaWiki dump namespace" % namespace)
return namespace


_get_namespace = get_namespace


def extract_pages(f, filter_namespaces=False, filter_articles=None):
"""Extract pages from a MediaWiki database dump.

Parameters
----------
f : file
File-like object.
filter_namespaces : list of str or bool
Namespaces that will be extracted.

Yields
------
tuple of (str or None, str, str)
Title, text and page id.

"""
elems = (elem for _, elem in iterparse(f, events=("end",)))

# We can't rely on the namespace for database dumps, since it's changed
# it every time a small modification to the format is made. So, determine
# those from the first element we find, which will be part of the metadata,
# and construct element paths.
elem = next(elems)
namespace = get_namespace(elem.tag)
ns_mapping = {"ns": namespace}
page_tag = "{%(ns)s}page" % ns_mapping
text_path = "./{%(ns)s}revision/{%(ns)s}text" % ns_mapping
title_path = "./{%(ns)s}title" % ns_mapping
ns_path = "./{%(ns)s}ns" % ns_mapping
pageid_path = "./{%(ns)s}id" % ns_mapping

for elem in elems:
if elem.tag == page_tag:
title = elem.find(title_path).text
text = elem.find(text_path).text

if filter_namespaces:
ns = elem.find(ns_path).text
if ns not in filter_namespaces:
text = None

if filter_articles is not None:
if not filter_articles(
elem, namespace=namespace, title=title,
text=text, page_tag=page_tag,
text_path=text_path, title_path=title_path,
ns_path=ns_path, pageid_path=pageid_path):
text = None

pageid = elem.find(pageid_path).text
yield title, text or "", pageid # empty page will yield None

# Prune the element tree, as per
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# except that we don't need to prune backlinks from the parent
# because we don't use LXML.
# We do this only for <page>s, since we need to inspect the
# ./revision/text element. The pages comprise the bulk of the
# file, so in practice we prune away enough.
elem.clear()

_extract_pages = extract_pages # for backward compatibility


def process_article(args):
# ORIGINAL VERSION
#def process_article(args, tokenizer_func=tokenize, token_min_len=TOKEN_MIN_LEN,
# token_max_len=TOKEN_MAX_LEN, lower=True):
"""Parse a Wikipedia article, extract all tokens.

Notes
-----
Set `tokenizer_func` (defaults is :func:`~gensim.corpora.wikicorpus.tokenize`) parameter for languages
like Japanese or Thai to perform better tokenization.
The `tokenizer_func` needs to take 4 parameters: (text: str, token_min_len: int, token_max_len: int, lower: bool).

Parameters
----------
args : (str, bool, str, int)
Article text, lemmatize flag (if True, :func:`~gensim.utils.lemmatize` will be used), article title,
page identificator.
tokenizer_func : function
Function for tokenization (defaults is :func:`~gensim.corpora.wikicorpus.tokenize`).
Needs to have interface:
tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str.
token_min_len : int
Minimal token length.
token_max_len : int
Maximal token length.
lower : bool
Convert article text to lower case?

Returns
-------
(list of str, str, int)
List of tokens from article, title and page id.

"""

text, lemmatize, title, pageid = args
text = filter_wiki(text)
if lemmatize:
result = utils.lemmatize(text)
else:
# ORIGINAL VERSION
# result = tokenizer_func(text, token_min_len, token_max_len, lower)
# NEW VERSION
result = tokenize(text)
# result = title + text
return result, title, pageid


def init_to_ignore_interrupt():
"""Enables interruption ignoring.

Warnings
--------
Should only be used when master is prepared to handle termination of
child processes.

"""
signal.signal(signal.SIGINT, signal.SIG_IGN)


def _process_article(args):
"""Same as :func:`~gensim.corpora.wikicorpus.process_article`, but with args in list format.

Parameters
----------
args : [(str, bool, str, int), (function, int, int, bool)]
First element - same as `args` from :func:`~gensim.corpora.wikicorpus.process_article`,
second element is tokenizer function, token minimal length, token maximal length, lowercase flag.

Returns
-------
(list of str, str, int)
List of tokens from article, title and page id.

Warnings
--------
Should not be called explicitly. Use :func:`~gensim.corpora.wikicorpus.process_article` instead.

"""
tokenizer_func, token_min_len, token_max_len, lower = args[-1]
args = args[:-1]

return process_article(
args, tokenizer_func=tokenizer_func, token_min_len=token_min_len,
token_max_len=token_max_len, lower=lower
)


class WikiCorpus(TextCorpus):
"""Treat a Wikipedia articles dump as a read-only, streamed, memory-efficient corpus.

Supported dump formats:

* <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2
* <LANG>wiki-latest-pages-articles.xml.bz2

The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

Notes
-----
Dumps for the English Wikipedia can be founded at https://dumps.wikimedia.org/enwiki/.

Attributes
----------
metadata : bool
Whether to write articles titles to serialized corpus.

Warnings
--------
"Multistream" archives are *not* supported in Python 2 due to `limitations in the core bz2 library
<https://docs.python.org/2/library/bz2.html#de-compression-of-files>`_.

Examples
--------
.. sourcecode:: pycon

>>> from gensim.test.utils import datapath, get_tmpfile
>>> from gensim.corpora import WikiCorpus, MmCorpus
>>>
>>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
>>> corpus_path = get_tmpfile("wiki-corpus.mm")
>>>
>>> wiki = WikiCorpus(path_to_wiki_dump) # create word->word_id mapping, ~8h on full wiki
>>> MmCorpus.serialize(corpus_path, wiki) # another 8h, creates a file in MatrixMarket format and mapping

"""
def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None,
filter_namespaces=('0',)):
# ORIGINAL VERSION
# filter_namespaces=('0',), tokenizer_func=tokenize, article_min_tokens=ARTICLE_MIN_WORDS,
# token_min_len=TOKEN_MIN_LEN, token_max_len=TOKEN_MAX_LEN, lower=True, filter_articles=None):
"""Initialize the corpus.

Unless a dictionary is provided, this scans the corpus once,
to determine its vocabulary.

Parameters
----------
fname : str
Path to the Wikipedia dump file.
processes : int, optional
Number of processes to run, defaults to `max(1, number of cpu - 1)`.
lemmatize : bool
Use lemmatization instead of simple regexp tokenization.
Defaults to `True` if you have the `pattern <https://github.com/clips/pattern>`_ package installed.
dictionary : :class:`~gensim.corpora.dictionary.Dictionary`, optional
Dictionary, if not provided, this scans the corpus once, to determine its vocabulary
**IMPORTANT: this needs a really long time**.
filter_namespaces : tuple of str, optional
Namespaces to consider.
tokenizer_func : function, optional
Function that will be used for tokenization. By default, use :func:`~gensim.corpora.wikicorpus.tokenize`.
If you inject your own tokenizer, it must conform to this interface:
`tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str`
article_min_tokens : int, optional
Minimum tokens in article. Article will be ignored if number of tokens is less.
token_min_len : int, optional
Minimal token length.
token_max_len : int, optional
Maximal token length.
lower : bool, optional
If True - convert all text to lower case.
filter_articles: callable or None, optional
If set, each XML article element will be passed to this callable before being processed. Only articles
where the callable returns an XML element are processed, returning None allows filtering out
some articles based on customised rules.

Warnings
--------
Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.

"""
self.fname = fname
self.filter_namespaces = filter_namespaces
# self.filter_articles = filter_articles
self.metadata = True
if processes is None:
processes = max(1, multiprocessing.cpu_count() - 1)
self.processes = processes
self.lemmatize = lemmatize
# self.tokenizer_func = tokenizer_func
# self.article_min_tokens = article_min_tokens
# self.token_min_len = token_min_len
# self.token_max_len = token_max_len
# self.lower = lower
# get_title = cur_title

if dictionary is None:
self.dictionary = Dictionary(self.get_texts())
else:
self.dictionary = dictionary

def get_texts(self):
"""Iterate over the dump, yielding a list of tokens for each article that passed
the length and namespace filtering.

Uses multiprocessing internally to parallelize the work and process the dump more quickly.

Notes
-----
This iterates over the **texts**. If you want vectors, just use the standard corpus interface
instead of this method:

Examples
--------
.. sourcecode:: pycon

>>> from gensim.test.utils import datapath
>>> from gensim.corpora import WikiCorpus
>>>
>>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
>>>
>>> for vec in WikiCorpus(path_to_wiki_dump):
... pass

Yields
------
list of str
If `metadata` is False, yield only list of token extracted from the article.
(list of str, (int, str))
List of tokens (extracted from the article), page id and article title otherwise.

"""
articles, articles_all = 0, 0
positions, positions_all = 0, 0
# ORIGINAL VERSION
# tokenization_params = (self.tokenizer_func, self.token_min_len, self.token_max_len, self.lower)
# texts = \
# ((text, self.lemmatize, title, pageid, tokenization_params)
# for title, text, pageid
# in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces, self.filter_articles))
# pool = multiprocessing.Pool(self.processes, init_to_ignore_interrupt)
# NEW VERSION
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid
in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
pool = multiprocessing.Pool(self.processes)
try:
# process the corpus in smaller chunks of docs, because multiprocessing.Pool
# is dumb and would load the entire input into RAM at once...

# ORIGINAL VERSION
# for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
# NEW VERSION
for group in utils.chunkize_serial(texts, chunksize=10 * self.processes):
# ORIGINAL VERSION
# for tokens, title, pageid in pool.imap(_process_article, group):
# NEW VERSION
for tokens, title, pageid in pool.imap(process_article, group): # chunksize=10):
articles_all += 1
positions_all += len(tokens)
# article redirects and short stubs are pruned here
# ORIGINAL VERSION
# if len(tokens) < self.article_min_tokens or \
# any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
# NEW VERSION FOR ENTIRE BLOCK
if len(tokens) < ARTICLE_MIN_WORDS or \
any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
continue
articles += 1
positions += len(tokens)
try:
if self.metadata:
title = title.replace(' ', '_')
title = (title + ',')
title = bytes(title, 'utf-8')
tokens.insert(0,title)
yield tokens
else:
yield tokens
except Exception as e:
print('Wikicorpus exception error: ' + str(e))
except KeyboardInterrupt:
logger.warn(
"user terminated iteration over Wikipedia corpus after %i documents with %i positions "
"(total %i articles, %i positions before pruning articles shorter than %i words)",
# ORIGINAL VERSION
# articles, positions, articles_all, positions_all, self.article_min_tokens
# NEW VERSION
articles, positions, articles_all, positions_all
)
except PicklingError as exc:
raise_from(PicklingError('Can not send filtering function {} to multiprocessing, '
'make sure the function can be pickled.'.format(self.filter_articles)), exc)
else:
logger.info(
"finished iterating over Wikipedia corpus of %i documents with %i positions "
"(total %i articles, %i positions before pruning articles shorter than %i words)",
# ORIGINAL VERSION
# articles, positions, articles_all, positions_all, self.article_min_tokens
# NEW VERSION
articles, positions, articles_all, positions_all
)
self.length = articles # cache corpus length
finally:
pool.terminate()

Gensim provides well documented code that is written in an understandable way.

Most of the modifications I made occurred at the bottom of the code listing. However, the text routine at the top of the file allows us to tailor what page ‘sections’ are kept or not in each Wikipedia article. Because of their substantive lexical content, I add the page templates and category names to be retained with the text body.

Assuming I will want to retain these modifications and understand them at a later date, I block off all modified sections with ORIGINAL VERSION and NEW VERSION tags. One change was to remove punctuation. Another was to grab and capture the article title.

This file, then, becomes a replacement to the original wikicorpus.py code. I am cognizant that changing underlying source code for local purposes is generally considered to be a BAD idea. It very well may be so in this case. However, with the backups, and being attentive to version updates and keeping working code in sync, I guess I do not see where keeping track of a modification is any less sustainable than needing to update existing code to a modification. Both require inspection and effort. If I diff on the changed underlying module, I suspect it is of equivalent effort or lesser effort to change a third-party interface modification.

The net result is that I am now capturing the substantive content of these articles in a form I want to process.

Remove Stoplist

In my initial workflow, I had the step of stoplist removal later in the process because I thought it might be helpful to have all text prior to phrase identification. A stoplist (also known as ‘stop words‘), by the way, is a listing of very common words (mostly conjuctions, common verb tenses, articles and propositions) that can be removed from a block of text without adversely affecting its meaning or readability.

Since it proved superior to not retain these stop words when forming n-grams (see next section), I moved the routine up to be next processing of the Wikipedia pages. Here is the relevant code:

import sys
from gensim.parsing.preprocessing import remove_stopwords  # Key line for stoplist
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full-stopped.txt'

more_stops = ['b', 'c', 'category', 'com', 'd', 'f', 'formatnum', 'g', 'gave', 'gov', 'h', 
              'htm', 'html', 'http', 'https', 'id', 'isbn', 'j', 'k', 'l', 'loc', 'm', 'n', 
              'need', 'needed', 'org', 'p', 'properties', 'q', 'r', 's', 'took', 'url', 'use', 
              'v', 'w', 'www', 'y', 'z']  
documents = smart_open(in_f, 'r', encoding='utf-8')
content = [doc.split(' ') for doc in documents]
with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in content:
        try:
            line = ', '.join(line)
            line = line.replace(', ', ' ')
            line = remove_stopwords(line)  
            querywords = line.split()
            resultwords = [word for word in querywords if word.lower() not in more_stops]
            line = ' '.join(resultwords)
            line = line + '\n'
            output.write(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('Stopwords applied to ' + str(i) + ' articles')
    output.close()
    print('Stopwords applied to ' + str(i) + ' articles;')
    print('Processing complete!')  
Stopwords applied to 10000 articles
Stopwords applied to 20000 articles
Stopwords applied to 30000 articles
Stopwords applied to 31157 articles;
Processing complete!

Gensim comes with its own stoplist, to which I added a few of my own, including removal of the category keyword that arose from adding that grouping. The output of this routine is the next file in the pipeline, wikipedia-output-full-stopped.txt.

Phrase Identification and Extraction

Phrases are n-grams, generally composed of two or three paired words, which are known as ‘bigrams’ and ‘trigrams’, respectively. Phrases are one of the most powerful ways to capture domain or technical language, since these compounded terms arise through the use and consensus of their users. Some phrases help disambiguate specific entities or places, as when for example ‘river’, ‘state’, ‘university’ or ‘buckeyes’ does when combined with the term ‘ohio’.

Generally, most embeddings or corpora do not include n-grams in their initial preparation. But, for the reasons above, and experience of the usefulness of n-grams to text retrieval, we decided to include phrase identification and extraction as part of our preprocessing.

Again, gensim comes with a pre-trained phrase identifier (like all gensim models, you can re-train and tune these models as you gain experience and want them to perform differently). The main work of this routine is the ngram call, wherein term adjacency is used to construct paired term indentifications. Here is the code and settings for our first pass with this function to create our initial bigrams from the stopped input text:

import sys
from gensim.models.phrases import Phraser, Phrases
from gensim.parsing.preprocessing import remove_stopwords  # Key line for stoplist
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full-stopped.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-bigram.txt'

documents = smart_open(in_f, 'r', encoding='utf-8')
sentence_stream = [doc.split(' ') for doc in documents]
common_terms = ['aka']
ngram = Phrases(sentence_stream, min_count=3,threshold=10, max_vocab_size=80000000, 
                delimiter=b'_', common_terms=common_terms)
ngram = Phraser(ngram)
content = list(ngram[sentence_stream])
with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in content:
        try:
            line = ', '.join(line)
            line = line.replace(', ', ' ')
            line = line.replace(' s ', '')
            output.write(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('ngrams calculated for ' + str(i) + ' articles')
    output.close()
    print('Calculated ngrams for ' + str(i) + ' articles;')
    print('Processing complete!')   
ngrams calculated for 10000 articles
ngrams calculated for 20000 articles
ngrams calculated for 30000 articles
Calculated ngrams for 31157 articles;
Processing complete!

This routine takes about 14 minutes to run on my laptop, with the settings as shown. Note in the routine where we set the delimiter to be the underscore character; this is how we know the bigram.

Once this routine finishes, we can take its output and re-use it as input to a subsequent run. Now, we will be producing trigrams where we can match to existing bigrams. Generally, we set our thresholds and minimum counts higher. In our case, the new settings are min_count=8, threshold=50 The trigram analysis takes 19 min to run.

We have now completed our preprocessing steps for the embedding models we introduce in the next installment.

Additional Documentation

Here are many supplementary resources useful to the environment and natural language processing capabilities introduced in this installment.

PyTorch and pandas

PyTorch Resources and Tutorials

spaCy and gensim

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on November 9, 2020 at 11:23 pm in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2414/cwpk-63-staging-data-sci-resources-and-preprocessing/
The URI to trackback this post is: https://www.mkbergman.com/2414/cwpk-63-staging-data-sci-resources-and-preprocessing/trackback/
Posted:November 5, 2020

Knowledge Graphs Deserve Attention in Their Own Right

We first introduced NetworkX in installment CWPK #56 of our Cooking with Python and KBpedia series. The purpose of NetworkX in that installment was to stage data for graph visualizations. In today’s installment, we look at the other side of the NetworkX coin; that is, as a graph analytics capability. We will also discuss NetworkX in relation to staging data for machine learning.

The idea of graphs or networks is at the center of the concept of knowledge graphs. Graphs are unique information artifacts that can be analyzed in their own right as well as being foundations for many unique analytical techniques, including for machine learning and its deep learning subset. Still, graphs as conceptual and mathematical structures are of relatively recent vintage. For example, the field known as graph theory is less than 300 years old. I outlined much of the intellectual history of graphs and their role in analysis in a 2012 article, The Age of the Graph.

Graph or network analysis has three principal aspects. The first aspect is to analyze the nature of the graph itself, with its connections, topologies and paths. The second is to use the structural aspects of the graph representation in order to conduct unique analyses. Some of these analyses relate to community or influence or relatedness. The third aspect is to use various or all aspects of the graph representation of the domain to provide, through dimensionality reduction, tractable and feature-rich methods for analyzing or conducting data science work useful to the domain. We’ll briefly cover the first two aspects in this installment. The remaining installments in this Part VI relate more to the third aspect of graph and deep graph representation.

Initial Setup

We will pick up with our NetworkX work from CWPK #56 to start this installment. (See the concluding sections below if you have not already generated the graph_specs.csv file.)

Since I have been away from the code for a bit, I first decide to make sure my Python packages are up-to-date by running this standard command:

>>>conda update --all

Then, we invoke our standard start-up routine:

from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *

We then want to bring NetworkX into our workspace, along with pandas for data management. The routine we are going to write will read our earlier graph_specs.csv file using pandas. We will use this specification to create a networkx representation of the KBpedia structure, and then begin reporting on some basic graph stats (which will take a few seconds to run):

import networkx as nx
import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
print('Graph construction complete.')

# Print the number of nodes in the graph
print('Number of Nodes:', len(G.nodes()))
#

print('Edges:', G.edges('Mammal'))
#
# Get the subgraph of all nodes around node
sub = [ n[1] for n in G.edges('Mammal') ]
# Actually, need to add the 'marge' node too
sub.append('Mammal')
#
# Now create a new graph, which is the subgraph
sg = nx.Graph(G.subgraph(sub))
#
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())
#
#
# Print basic graph info
#info=nx.info(G)
print('Basic graph info:', nx.info(G))

We have picked ‘mammal’ to generate some subgraphs and we also call up basic graph info based on networkx. As a directed graph, KBpedia can be characterized by both ‘in degree’ and ‘out degree’. ‘in degree’ is the number of edges pointing to a given node (or vertex); ‘out degree’ is the opposite. The average across all nodes in KBpedia exceeds 1.3. Both measures are the same because our only edge type in this structure is subClassOf, which is transitive.

Network Metrics and Operations

So we see that our KBpedia graph has loaded properly, and now we are ready to do some basic network analysis. Most of the analysis deals with the relations structure of the graph. NetworkX has a very clean interface to common measures and metrics of graphs, as our examples below demonstrate.

Density‘ is the ratio of actual edges in the network to all possible edges in the network, and ranges from 0 to 1. A ‘dense’ graph is one where the number of edges is close to the maximal number of edges; a ‘sparse’ graph is the opposite. The maximal number of edges is calculated as the potential connections, or nodes X (nodes -1). This potential is multiplied by two for a directed graph, since A → B and B → A are both possible. The density is thus the actual number of connections divided by the potential number. The density of KBpedia is quite sparse.

print('Density:', nx.density(G))

Degree‘ is a measure to find the most important nodes in graph, since a node’s degree is the sum of its edges. You can find the degree for an individual node, or the max ones as these two algorithms indicate:

print('Degree:', nx.degree(G,'Mammal'))

Average clustering‘ is the sum of all node clusterings. A node is clustered if it has a relatively high number of actual links to neighbors in relation to potential links to neighbors. A small-world network is one where the distance between random nodes grows in proportion to the natural log of the number of nodes in the graph. Low average clustering is an indicator of a small-world network.

print('Average clustering:', nx.average_clustering(G))

G_node = 'Mammal'
print('Clustering for node', G_node, ':', nx.clustering(G, G_node))

Path length‘ is calculated as the number of hop jumps traversing two end nodes is a network. An ‘average path length‘ measures shortest paths over a graph and then averages them. A small number indicates a shorter, more easily navigated graph on average, but there can be much variance.

print('Average shortest path length:', nx.average_shortest_path_length(G))

The next three measures throw an error, since KBpedia ‘is not strongly connected.’ ‘Eccentricity‘ is the maximum length between a node and its connecting nodes in a graph, with the ‘diameter‘ being the maximum eccentricity across all nodes and the ‘radius‘ being the minimum.

print('Eccentricity:', nx.eccentricity(G))
print('Diameter:', nx.diameter(G))
print('Radius:', nx.radius(G))

The algorithms that follow take longer to calculate or produce long listings. The first such measure we see is ‘centrality‘, which in NetworkX is the number of connections to a given node, with higher connectivity a proxy for importance. Centrality can be measured in many different ways; there are multiple options in NetworkX.

# Calculate different centrality measures
print('Centrality:', nx.degree_centrality(G))
print('Centrality (eigenvector):', nx.eigenvector_centrality(G))
print('In-degree centrality:', nx.in_degree_centrality(G))
print('Out-degree centrality:', nx.out_degree_centrality(G))

Here are some longer analysis routines (unfortunately, betweenness takes hours to calculate):

# Calculate different centrality measures
print('Betweenness:', nx.betweenness_centrality(G))

As a directed graph, some NetworkX measures are not applicable. Here are some of them:

  • nx.is_connected(G)
  • nx.connected_components(G).

Subgraphs

We earlier showed code for extracting a subgraph. Here is a generalized version of that function. Replace the ‘Bird’ reference concept with any other valid RC from KBpedia:

# Provide label for current KBpedia reference concept
rc = 'Bird'
# Get the subgraph of all nodes around node
sub = [ n[1] for n in G.edges(rc) ]
# Actually, need to add the 'rc' node too
sub.append(rc)
#
# Now create a new graph, which is the subgraph
sg = nx.Graph(G.subgraph(sub))
#
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())

DeepGraphs

There is a notable utility package called DeepGraphs (and its documentation) that appears to offer some nice partitioning and quick visualization options. I have not installed or tested it.

Full Network Exchange

So far, we have seen the use of networks in driving visualizations (CWPK #56) and, per above, as knowledge artifacts with their own unique characteristics and metrics. The next role we need to highlight for networks is as information providers and graph-based representations of structure and features to analytical applications and machine learners.

NetworkX can convert to and from other data formats:

All of these are attractive because PyTorch has direct routines for them.

NetworkX can also read and write graphs in multiple formats, some of which include:

There are also standard NetworkX functions to convert node and edge labels to integers (such as networkx.relabel.convert_node_labels_to_integers), relabel nodes (networkx.relabel.relabel_nodes), set node attributes (networkx.classes.function.set_node_attributes), or make deep copies (networkx.Graph.to_directed).

There are also certain packages that integrate well with NetworkX and PyTorch and related packages such as direct imports or exports to the Deep Graph Library (DGL) (see CWPK #68 and #69), or built-in converters or the DeepSNAP package may provide a direct bridge between NetworkX and PyTorch Geometric (PyG) (see CWPK #68 and #70).

However, these representations do NOT include the labeled information or annotations. Knowledge graphs, like KBpedia, have some unique aspects that are not fully captured by an existing package like NetworkX.

Fortunately, the previous extract-and-build routines at the heart of this Cooking with Python and KBpedia series are based around CSV files, the same basis as the pandas package. Via pandas we can capture the structure of KBpedia, plus its labels and annotations. Further, as we will see in the next installment, we can also capture full pages for most of these RCs in KBpedia from Wikipedia. This addition will greatly expand our context and feature basis for using KBpedia for machine learning.

For now, I present below two of these three inputs, extracted directly from the KBpedia knowledge graph.

KBpedia Structure

The first of two extraction files useful to all further installments in this Part VI provides the structure of KBpedia. This structure consists of the hierarchical relation between reference concepts using the subClassOf subsumption relation and the assignment of that RC to a typology (SuperType). I first presented this routine in CWPK #56 and it, indeed, captures the requisite structure of the graph:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : kko_order_dict.values(),                          # Note 1   
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/mappings/',              
# 'ext'           : '.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv',

def graph_extractor(**extract_deck):
    print('Beginning graph structure extraction . . .')
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    class_loop = extract_deck.get('class_loop')
    base = extract_deck.get('base')
    ext = extract_deck.get('ext')
    
    # Note 2
    parent_set = ['kko.SocialSystems','kko.Products','kko.Methodeutic','kko.Eukaryotes',
              'kko.ConceptualSystems','kko.AVInfo','kko.Systems','kko.Places',
              'kko.OrganicChemistry','kko.MediativeRelations','kko.LivingThings',
              'kko.Information','kko.CopulativeRelations','kko.Artifacts','kko.Agents',
              'kko.TimeTypes','kko.Symbolic','kko.SpaceTypes','kko.RepresentationTypes',
              'kko.RelationTypes','kko.OrganicMatter','kko.NaturalMatter',
              'kko.AttributeTypes','kko.Predications','kko.Manifestations',
              'kko.Constituents']

    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    header = ['target', 'source', 'weight', 'SuperType']
    out_file = extract_deck.get('out_file')
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           
        csv_out = csv.writer(output)
        csv_out.writerow(header)    
        for value in loop_list:
            print('   . . . processing', value)
            s_set = []
            root = eval(value)
            s_set = root.descendants()
            frag = value.replace('kko.','')
            for s_item in s_set:
                child_set = list(s_item.subclasses())
                count = len(list(child_set))
                
# Note 3                
                if value not in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        cur_list.append(new_pair)
                        s_rc = s_rc.replace('rc.','')
                        child = child.replace('rc.','')
                        row_out = (s_rc,child,count,frag)
                        csv_out.writerow(row_out)
                elif value in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        if new_pair not in cur_list:
                            cur_list.append(new_pair)
                            s_rc = s_rc.replace('rc.','')
                            child = child.replace('rc.','')
                            row_out = (s_rc,child,count,frag)
                            csv_out.writerow(row_out)
                        elif new_pair in cur_list:
                            continue
        output.close()         
        print('Processing is complete . . .')
graph_extractor(**extract_deck)

Note, again, the parent_set ordering of typology processing at the top of this function. This ordering processes the more distal (leaf) typologies first, and then ignores subsequent processing of identical structural relationships. This means that the graph structure is cleaner and all subsumption relations are “pushed down” to their most specific mention.

You can inspect the actual structure file produced using this routine, which is also the general basis for reading into various machine learners:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')

df

KBpedia Annotations

And, we also need to bring in the annotation values. The annotation extraction routine was first presented and described in CWPK #33, and was subsequently generalized and brought into conformance with our configuration routines in CWPK #33. Note, for example, in the header definition, how we are able to handle either classes or properties. In this instance, plus all subsequent machine learning discussion, we concentrate on the labels and annotations for classes:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###                
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'class_loop',
# 'loop_list'     : custom_dict.values(),                              # Single 'Generals' specified 
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv',
# 'render'        : 'r_label',

def annot_extractor(**extract_deck):
    print('Beginning annotation extraction . . .') 
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return    
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    """ These are internal counters used in this module's methods """
    p_set = []
    a_ser = []
    x = 1
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)                                       
        if loop == 'class_loop':                                             
            header = ['id', 'prefLabel', 'subClassOf', 'altLabel', 
                      'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']
        else:
            header = ['id', 'prefLabel', 'subPropertyOf', 'domain', 'range', 
                      'functional', 'altLabel', 'definition', 'editorialNote']
        csv_out.writerow(header)    
        for value in loop_list:                                            
            print('   . . . processing', value)                                           
            root = eval(value) 
            if descent_type == 'descent':
                p_set = root.descendants()
            elif descent_type == 'single':
                a_set = root
                p_set.append(a_set)
            else:
                print('You have assigned an incorrect descent method--execution stopping.')
                return    
            for p_item in p_set:
                if p_item not in cur_list:                                 
                    a_pref = p_item.prefLabel
                    a_pref = str(a_pref)[1:-1].strip('"\'')                
                    a_sub = p_item.is_a
                    for a_id, a in enumerate(a_sub):                        
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_sub + '||' + str(a)
                        a_sub  = a_item
                    if loop == 'property_loop':   
                        a_item = ''
                        a_dom = p_item.domain
                        for a_id, a in enumerate(a_dom):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_dom + '||' + str(a)
                            a_dom  = a_item    
                        a_dom = a_item
                        a_rng = p_item.range
                        a_rng = str(a_rng)[1:-1]
                        a_func = ''
                    a_item = ''
                    a_alt = p_item.altLabel
                    for a_id, a in enumerate(a_alt):
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_alt + '||' + str(a)
                        a_alt  = a_item    
                    a_alt = a_item
                    a_def = p_item.definition
                    a_def = str(a_def)[2:-2]
                    a_note = p_item.editorialNote
                    a_note = str(a_note)[1:-1]
                    if loop == 'class_loop':                                  
                        a_isby = p_item.isDefinedBy
                        a_isby = str(a_isby)[2:-2]
                        a_isby = a_isby + '/'
                        a_item = ''
                        a_super = p_item.superClassOf
                        for a_id, a in enumerate(a_super):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_super + '||' + str(a)
                            a_super = a_item    
                        a_super  = a_item
                    if loop == 'class_loop':                                  
                        row_out = (p_item,a_pref,a_sub,a_alt,a_def,a_note,a_isby,a_super)
                    else:
                        row_out = (p_item,a_pref,a_sub,a_dom,a_rng,a_func,
                                   a_alt,a_def,a_note)
                    csv_out.writerow(row_out)                               
                    cur_list.append(p_item)
                    x = x + 1
    print('Total unique IDs written to file:', x)  
    print('The annotation extraction for the', loop, 'is completed.') 

You can inspect this actual file of labels and annotations using this routine:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv')

df

We will add Wikipedia pages as a third source for informing our machine learning tests and experiments in our next installment.

Untested Potentials

One area in extended NetworkX capabilities that we do not test here is community structure using the Louvain Community Detection package.

Additional Documentation

Here are additional resources on network analysis and NetworkX:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on November 5, 2020 at 11:07 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2413/cwpk-62-network-and-graph-analysis/
The URI to trackback this post is: https://www.mkbergman.com/2413/cwpk-62-network-and-graph-analysis/trackback/
Posted:November 2, 2020

A Wealth of Applications Sets the Stage for Pay Offs from KBpedia

With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, “data science”) discussion, and the last part where significant code is developed and documented. Because of the complexity of these installments, we will also be reducing the number released per week for the next month or so. We also will not be able to post fully operational electronic notebooks to MyBinder since the supporting libraries strain the limits of that service. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.

Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.

KBpedia’s (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate ‘slices’ or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.

Second, 80% of KBpedia’s concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic ‘signals’. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.

And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The ‘deep’ appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.

The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at ‘standard’ machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.

The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.

So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python ‘ecosystems’ and ‘frameworks’ in this part is to be better prepared to incorporate those innovations and learnings to come.

Background

One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.

Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:

Machine Learning Landscape
Figure 1: Machine Learning Landscape (from S. Chen, “Machine Learning Algorithms For Beginners with Code Examples in Python”, June 2020)

There are many possible diagrams that one might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of ‘classification’ is a supervised one, ‘clustering’ a notion of unsupervised.

We will include a ‘standard’ machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.

All machine learners need to operate on their feature spaces in numerical representations. Text is a tricky form because language is difficult and complex, and how to represent the tokens within our language usable by a computer needs to consider, what? Parts-of-speech, the word itself, sentence construction, semantic meaning, context, adjacency, entity recognition or characterization? These may all figure into how one might represent text. Machine learning has brought us unsupervised methods for converting words to sentences to documents and, now, graphs, to a reduced, numeric representation known as “embeddings.” The embedding method may capture one or more of these textual or structural aspects.

Much of the first interest in machine learning based on graphs was driven by these interests in embeddings for language text. Standard machine classifiers with deep learning using neural networks have given us word2vec, and more recently BERT and its dozens of variants have reinforced the usefulness of deep learning to create pre-trained text representations.

Indeed, embeddings do figure prominently in knowledge graph representation, but only as one among many useful features. Knowledge graphs with hierarchical (subsumption) relationships, as might be found in any taxonomy, become directed. Knowledge graphs are asymmetrical, and often multi-typed and sometimes multi-modal. There is heterogeneity among nodes and links or edges. Not all knowledge graphs are created equal and some of these aspects may not apply. Whether there is an accompanying richness of text description that accompanies the node or edges is another wrinkle. None of the early CNN or RNN or simple neural net approaches match well with these structures.

The general category that appears to have emerged for this scope is geometric deep learning, which applies to all forms of graphs and manifolds. There are other nuances in this area, for example whether a static representation is the basis for analysis or one that is dynamic, essentially allowing learning parameters to be changed as the deep learning progresses through its layers. But GDL has the theoretical potential to address and incorporate all of the wrinkles associated with heterogeneous knowledge graphs.

So, this discussion helps define our desired scope. We want to be able to embrace Python packages that range from simple statistics to simple machine learning, throwing in natural language processing and creating embedding representations, that can then range all the way through deep learning to the cutting-edge aspects of geometric or graph deep learning.

Leading Python Data Science Packages

This background provides the necessary context for our investigations of Python packages, frameworks, or libraries that may fulfill the data science objectives of this part. Our new components often build upon and need to play nice with some of the other requisite packages introduced in earlier installments, including pandas (CWPK #55), NetworkX (CWPK #56), and PyViz (CWPK #55). NumPy has been installed, but not discussed.

We want to focus our evaluation of Python options in these areas:

  • Natural Language Processing, including embeddings
  • ‘Standard’ Machine Learning
  • Deep Learning and Abstraction Frameworks, and
  • Knowledge Graph Representation Learning.

The latter area may help us tie these various components together.

Natural Language Processing

It is not fair to say that natural language processing has become a ‘commodity’ in the data science space, but it is also true there is a wealth of capable, complete packages within Python. There are standard NLP requirements like text cleaning, tokenization, parts-of-speech identification, parsing, lemmatization, phrase identification, and so forth. We want these general text processing capabilities since they are often building blocks and sometimes needed in their own right. We also would like to add to this baseline such considerations as interoperability, creating embeddings, or other special functions.

The two leading NLP packages in Python appear to be:

  • NLTK – the natural language toolkit that is proven and has been a leader for twenty years
  • spaCy – a newer, but very impressive package oriented more to tasks, not function calls.

Other leading packages, with varying NLP scope, include:

  • flair – a very simple framework for state-of-the-art NLP that is based on PyTorch and works based on context
  • gensim – a semantic and topic modeling library; not general purpose, but with valuable capabilities
  • OpenNMT-py – an open source library for neural machine translation and neural sequence learning; provided for both the PyTorch and TensorFlow environments
  • Polyglot – a natural language pipeline that supports massive multilingual applications
  • Stanza – a neural network pipeline for text analytics; beyond standard functions, has multi-word token (MWT) expansion, morphological features, and dependency parsing; uses the Java CoreNLP from Stanford
  • TextBlob – a simplified text processor, which is an extension to NLTK.

Another key area is language embedding. Language embeddings are means to translate language into a numerical representation for use in downstream analysis, with great variety in what aspects of language are captured and how to craft them. The simplest and still widely-used representation is tf-idf (term frequency–inverse document frequency) statistical measure. A common variant after that was the vector space model. We also have latent (unsupervised) models such as LDA. A more easily calculated option is explicit semantic analysis (ESA). At the word level, two of the prominent options are word2vec and gloVe, which are used directly in spaCy. These have arisen from deep learning models. We also have similar approaches to represent topics (topicvec), sentences (sentence2vec), categories and paragraphs (Category2Vec), documents (doc2vec), node2vec or entire languages (BERT and variants and GPT-3 and related methods). In all of these cases, the embedding consists of reducing the dimensionality of the input text, which is then represented in numeric form.

There are internal methods for creating embeddings in multiple machine learning libraries. Some packages are more dedicated, such as fastText, which is a library for learning of word embeddings and text classification created by Facebook’s AI Research (FAIR) lab. Another option is TextBrewer, which is an open-source knowledge distillation toolkit based on PyTorch and which uses (among others) BERT to provide text classification, reading comprehension, NER or sequence labeling.

Closely related to how we represent text are corpora and datasets that may be used either for reference or training purposes. These need to be assembled and tested as well as software packages. The availability of corpora to different packages is a useful evaluation criterion. But, the picking of specific corpora depends on the ultimate Python packages used and the task at hand. We will return to this topic in CWPK #63.

‘Standard’ Machine Learning

Of course, nearly all of the Python packages mentioned in this Part VI have some relation to machine learning in one form or another. I call out the ‘standard’ machine learning category separately because, like for NLP, I think it makes sense to have a general learning library not devoted to deep learning but providing a repository of classic learning methods.

There really is no general option that compares with scikit-learn. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means and DBSCAN data clustering, and is designed to interoperate with NumPy and SciPy. The project is extremely active with good documentation and examples.

We’ll return to scikit-learn below.

Deep Learning and Abstraction Frameworks

Deep learning is characterized by many options, methods and philosophies, all in a fast-changing area of knowledge. New methods need to be compared on numerous grounds from feature and training set selection to testing, parameter tuning, and performance comparisons. These realities have put a premium on libraries and frameworks that wrap methods in repeatable interfaces and provide abstract functions for setting up and managing various deep (and other) learning algorithms.

The space of deep learning thus embraces many individual methods and forms, often expressed through a governing ecosystem of other tools and packages. These demands lead to a confusing and overlapping and non-intersecting space of Python options that are hard to describe and comparatively evaluate. Here are some of the libraries and packages that fit within the deep and machine learning space, including abstraction frameworks:

  • Chainer is an open source deep learning framework written purely in Python on top of NumPy and CuPy Python libraries
  • Microsoft Cognitive Toolkit (CNTK) is an open-source toolkit for commercial-grade distributed deep learning; however, it has seen its last main release in favor of the interoperable approach, ONNX (see below)
  • Keras is an open-source library that provides a Python interface for artificial neural networks. Keras now acts as an interface for the TensorFlow library and is built on top of Theano; it has a high-level library for working with datasets
  • PlaidML is a portable tensor compiler; it runs as a component under Keras
  • PyTorch is an open source machine learning library based on the Torch library with a very rich ecosystem of interfacing or contributing projects
  • TensorFlow is a well-known open source machine learning library developed by Google
  • Theano is a Python library and optimizing compiler for manipulating and evaluating mathematical expressions, especially matrix-valued ones; it is tightly integrated with NumPy, and uses it at the lowest level.

Keras is increasingly aligning with TensorFlow and some, like Chainer and CNTK, are being deprecated in favor of the two leading gorillas, PyTorch and TensorFlow. One approach to improve interoperability is the Open Neural Network Exchange (ONNX) with the repository available on GitHub. There are existing converters to ONNX for Keras, TensorFlow, PyTorch and scikit-learn.

A key development from deep learning of the past three years has been the usefulness of Transformers, a technique that marries encoders and decoders converging on the same representation. The technique is particularly helpful to sequential data and NLP, with state-of-the-art performance to date for:

  • next-sentence prediction
  • question answering
  • reading comprehension
  • sentiment analysis, and
  • paraphrasing.

Both BERT and GPT are pre-trained products that utilize this method. Both TensorFlow and PyTorch contain Transformer capabilities.

Knowledge Graph Representation Learning

As noted, most of my research for this Part VI has resided in the area of a subset of deep graph learning applicable to knowledge graphs. The leading deep learning libraries do not, in general, provide support for this area of representational learning, sometimes called knowledge representation learning (KRL) or knowledge graph embedding (KGE). Within this rather limited scope, most options also seem oriented to link predicti