Posted:November 5, 2020

CWPK #62: Network and Graph Analysis

Knowledge Graphs Deserve Attention in Their Own Right

We first introduced NetworkX in installment CWPK #56 of our Cooking with Python and KBpedia series. The purpose of NetworkX in that installment was to stage data for graph visualizations. In today’s installment, we look at the other side of the NetworkX coin; that is, as a graph analytics capability. We will also discuss NetworkX in relation to staging data for machine learning.

The idea of graphs or networks is at the center of the concept of knowledge graphs. Graphs are unique information artifacts that can be analyzed in their own right as well as being foundations for many unique analytical techniques, including for machine learning and its deep learning subset. Still, graphs as conceptual and mathematical structures are of relatively recent vintage. For example, the field known as graph theory is less than 300 years old. I outlined much of the intellectual history of graphs and their role in analysis in a 2012 article, The Age of the Graph.

Graph or network analysis has three principal aspects. The first aspect is to analyze the nature of the graph itself, with its connections, topologies and paths. The second is to use the structural aspects of the graph representation in order to conduct unique analyses. Some of these analyses relate to community or influence or relatedness. The third aspect is to use various or all aspects of the graph representation of the domain to provide, through dimensionality reduction, tractable and feature-rich methods for analyzing or conducting data science work useful to the domain. We’ll briefly cover the first two aspects in this installment. The remaining installments in this Part VI relate more to the third aspect of graph and deep graph representation.

Initial Setup

We will pick up with our NetworkX work from CWPK #56 to start this installment. (See the concluding sections below if you have not already generated the graph_specs.csv file.)

Since I have been away from the code for a bit, I first decide to make sure my Python packages are up-to-date by running this standard command:

>>>conda update --all

Then, we invoke our standard start-up routine:

from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *

We then want to bring NetworkX into our workspace, along with pandas for data management. The routine we are going to write will read our earlier graph_specs.csv file using pandas. We will use this specification to create a networkx representation of the KBpedia structure, and then begin reporting on some basic graph stats (which will take a few seconds to run):

import networkx as nx
import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
print('Graph construction complete.')

# Print the number of nodes in the graph
print('Number of Nodes:', len(G.nodes()))
#

print('Edges:', G.edges('Mammal'))
#
# Get the subgraph of all nodes around node
sub = [ n[1] for n in G.edges('Mammal') ]
# Actually, need to add the 'marge' node too
sub.append('Mammal')
#
# Now create a new graph, which is the subgraph
sg = nx.Graph(G.subgraph(sub))
#
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())
#
#
# Print basic graph info
#info=nx.info(G)
print('Basic graph info:', nx.info(G))

We have picked ‘mammal’ to generate some subgraphs and we also call up basic graph info based on networkx. As a directed graph, KBpedia can be characterized by both ‘in degree’ and ‘out degree’. ‘in degree’ is the number of edges pointing to a given node (or vertex); ‘out degree’ is the opposite. The average across all nodes in KBpedia exceeds 1.3. Both measures are the same because our only edge type in this structure is subClassOf, which is transitive.

Network Metrics and Operations

So we see that our KBpedia graph has loaded properly, and now we are ready to do some basic network analysis. Most of the analysis deals with the relations structure of the graph. NetworkX has a very clean interface to common measures and metrics of graphs, as our examples below demonstrate.

Density‘ is the ratio of actual edges in the network to all possible edges in the network, and ranges from 0 to 1. A ‘dense’ graph is one where the number of edges is close to the maximal number of edges; a ‘sparse’ graph is the opposite. The maximal number of edges is calculated as the potential connections, or nodes X (nodes -1). This potential is multiplied by two for a directed graph, since A → B and B → A are both possible. The density is thus the actual number of connections divided by the potential number. The density of KBpedia is quite sparse.

print('Density:', nx.density(G))

Degree‘ is a measure to find the most important nodes in graph, since a node’s degree is the sum of its edges. You can find the degree for an individual node, or the max ones as these two algorithms indicate:

print('Degree:', nx.degree(G,'Mammal'))

Average clustering‘ is the sum of all node clusterings. A node is clustered if it has a relatively high number of actual links to neighbors in relation to potential links to neighbors. A small-world network is one where the distance between random nodes grows in proportion to the natural log of the number of nodes in the graph. Low average clustering is an indicator of a small-world network.

print('Average clustering:', nx.average_clustering(G))

G_node = 'Mammal'
print('Clustering for node', G_node, ':', nx.clustering(G, G_node))

Path length‘ is calculated as the number of hop jumps traversing two end nodes is a network. An ‘average path length‘ measures shortest paths over a graph and then averages them. A small number indicates a shorter, more easily navigated graph on average, but there can be much variance.

print('Average shortest path length:', nx.average_shortest_path_length(G))

The next three measures throw an error, since KBpedia ‘is not strongly connected.’ ‘Eccentricity‘ is the maximum length between a node and its connecting nodes in a graph, with the ‘diameter‘ being the maximum eccentricity across all nodes and the ‘radius‘ being the minimum.

print('Eccentricity:', nx.eccentricity(G))
print('Diameter:', nx.diameter(G))
print('Radius:', nx.radius(G))

The algorithms that follow take longer to calculate or produce long listings. The first such measure we see is ‘centrality‘, which in NetworkX is the number of connections to a given node, with higher connectivity a proxy for importance. Centrality can be measured in many different ways; there are multiple options in NetworkX.

# Calculate different centrality measures
print('Centrality:', nx.degree_centrality(G))
print('Centrality (eigenvector):', nx.eigenvector_centrality(G))
print('In-degree centrality:', nx.in_degree_centrality(G))
print('Out-degree centrality:', nx.out_degree_centrality(G))

Here are some longer analysis routines (unfortunately, betweenness takes hours to calculate):

# Calculate different centrality measures
print('Betweenness:', nx.betweenness_centrality(G))

As a directed graph, some NetworkX measures are not applicable. Here are some of them:

  • nx.is_connected(G)
  • nx.connected_components(G).

Subgraphs

We earlier showed code for extracting a subgraph. Here is a generalized version of that function. Replace the ‘Bird’ reference concept with any other valid RC from KBpedia:

# Provide label for current KBpedia reference concept
rc = 'Bird'
# Get the subgraph of all nodes around node
sub = [ n[1] for n in G.edges(rc) ]
# Actually, need to add the 'rc' node too
sub.append(rc)
#
# Now create a new graph, which is the subgraph
sg = nx.Graph(G.subgraph(sub))
#
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())

DeepGraphs

There is a notable utility package called DeepGraphs (and its documentation) that appears to offer some nice partitioning and quick visualization options. I have not installed or tested it.

Full Network Exchange

So far, we have seen the use of networks in driving visualizations (CWPK #56) and, per above, as knowledge artifacts with their own unique characteristics and metrics. The next role we need to highlight for networks is as information providers and graph-based representations of structure and features to analytical applications and machine learners.

NetworkX can convert to and from other data formats:

All of these are attractive because PyTorch has direct routines for them.

NetworkX can also read and write graphs in multiple formats, some of which include:

There are also standard NetworkX functions to convert node and edge labels to integers (such as networkx.relabel.convert_node_labels_to_integers), relabel nodes (networkx.relabel.relabel_nodes), set node attributes (networkx.classes.function.set_node_attributes), or make deep copies (networkx.Graph.to_directed).

There are also certain packages that integrate well with NetworkX and PyTorch and related packages such as direct imports or exports to the Deep Graph Library (DGL) (see CWPK #68 and #69), or built-in converters or the DeepSNAP package may provide a direct bridge between NetworkX and PyTorch Geometric (PyG) (see CWPK #68 and #70).

However, these representations do NOT include the labeled information or annotations. Knowledge graphs, like KBpedia, have some unique aspects that are not fully captured by an existing package like NetworkX.

Fortunately, the previous extract-and-build routines at the heart of this Cooking with Python and KBpedia series are based around CSV files, the same basis as the pandas package. Via pandas we can capture the structure of KBpedia, plus its labels and annotations. Further, as we will see in the next installment, we can also capture full pages for most of these RCs in KBpedia from Wikipedia. This addition will greatly expand our context and feature basis for using KBpedia for machine learning.

For now, I present below two of these three inputs, extracted directly from the KBpedia knowledge graph.

KBpedia Structure

The first of two extraction files useful to all further installments in this Part VI provides the structure of KBpedia. This structure consists of the hierarchical relation between reference concepts using the subClassOf subsumption relation and the assignment of that RC to a typology (SuperType). I first presented this routine in CWPK #56 and it, indeed, captures the requisite structure of the graph:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : kko_order_dict.values(),                          # Note 1   
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/mappings/',              
# 'ext'           : '.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv',

def graph_extractor(**extract_deck):
    print('Beginning graph structure extraction . . .')
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    class_loop = extract_deck.get('class_loop')
    base = extract_deck.get('base')
    ext = extract_deck.get('ext')
    
    # Note 2
    parent_set = ['kko.SocialSystems','kko.Products','kko.Methodeutic','kko.Eukaryotes',
              'kko.ConceptualSystems','kko.AVInfo','kko.Systems','kko.Places',
              'kko.OrganicChemistry','kko.MediativeRelations','kko.LivingThings',
              'kko.Information','kko.CopulativeRelations','kko.Artifacts','kko.Agents',
              'kko.TimeTypes','kko.Symbolic','kko.SpaceTypes','kko.RepresentationTypes',
              'kko.RelationTypes','kko.OrganicMatter','kko.NaturalMatter',
              'kko.AttributeTypes','kko.Predications','kko.Manifestations',
              'kko.Constituents']

    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    header = ['target', 'source', 'weight', 'SuperType']
    out_file = extract_deck.get('out_file')
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           
        csv_out = csv.writer(output)
        csv_out.writerow(header)    
        for value in loop_list:
            print('   . . . processing', value)
            s_set = []
            root = eval(value)
            s_set = root.descendants()
            frag = value.replace('kko.','')
            for s_item in s_set:
                child_set = list(s_item.subclasses())
                count = len(list(child_set))
                
# Note 3                
                if value not in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        cur_list.append(new_pair)
                        s_rc = s_rc.replace('rc.','')
                        child = child.replace('rc.','')
                        row_out = (s_rc,child,count,frag)
                        csv_out.writerow(row_out)
                elif value in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        if new_pair not in cur_list:
                            cur_list.append(new_pair)
                            s_rc = s_rc.replace('rc.','')
                            child = child.replace('rc.','')
                            row_out = (s_rc,child,count,frag)
                            csv_out.writerow(row_out)
                        elif new_pair in cur_list:
                            continue
        output.close()         
        print('Processing is complete . . .')
graph_extractor(**extract_deck)

Note, again, the parent_set ordering of typology processing at the top of this function. This ordering processes the more distal (leaf) typologies first, and then ignores subsequent processing of identical structural relationships. This means that the graph structure is cleaner and all subsumption relations are “pushed down” to their most specific mention.

You can inspect the actual structure file produced using this routine, which is also the general basis for reading into various machine learners:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')

df

KBpedia Annotations

And, we also need to bring in the annotation values. The annotation extraction routine was first presented and described in CWPK #33, and was subsequently generalized and brought into conformance with our configuration routines in CWPK #33. Note, for example, in the header definition, how we are able to handle either classes or properties. In this instance, plus all subsequent machine learning discussion, we concentrate on the labels and annotations for classes:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###                
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'class_loop',
# 'loop_list'     : custom_dict.values(),                              # Single 'Generals' specified 
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv',
# 'render'        : 'r_label',

def annot_extractor(**extract_deck):
    print('Beginning annotation extraction . . .') 
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return    
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    """ These are internal counters used in this module's methods """
    p_set = []
    a_ser = []
    x = 1
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)                                       
        if loop == 'class_loop':                                             
            header = ['id', 'prefLabel', 'subClassOf', 'altLabel', 
                      'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']
        else:
            header = ['id', 'prefLabel', 'subPropertyOf', 'domain', 'range', 
                      'functional', 'altLabel', 'definition', 'editorialNote']
        csv_out.writerow(header)    
        for value in loop_list:                                            
            print('   . . . processing', value)                                           
            root = eval(value) 
            if descent_type == 'descent':
                p_set = root.descendants()
            elif descent_type == 'single':
                a_set = root
                p_set.append(a_set)
            else:
                print('You have assigned an incorrect descent method--execution stopping.')
                return    
            for p_item in p_set:
                if p_item not in cur_list:                                 
                    a_pref = p_item.prefLabel
                    a_pref = str(a_pref)[1:-1].strip('"\'')                
                    a_sub = p_item.is_a
                    for a_id, a in enumerate(a_sub):                        
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_sub + '||' + str(a)
                        a_sub  = a_item
                    if loop == 'property_loop':   
                        a_item = ''
                        a_dom = p_item.domain
                        for a_id, a in enumerate(a_dom):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_dom + '||' + str(a)
                            a_dom  = a_item    
                        a_dom = a_item
                        a_rng = p_item.range
                        a_rng = str(a_rng)[1:-1]
                        a_func = ''
                    a_item = ''
                    a_alt = p_item.altLabel
                    for a_id, a in enumerate(a_alt):
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_alt + '||' + str(a)
                        a_alt  = a_item    
                    a_alt = a_item
                    a_def = p_item.definition
                    a_def = str(a_def)[2:-2]
                    a_note = p_item.editorialNote
                    a_note = str(a_note)[1:-1]
                    if loop == 'class_loop':                                  
                        a_isby = p_item.isDefinedBy
                        a_isby = str(a_isby)[2:-2]
                        a_isby = a_isby + '/'
                        a_item = ''
                        a_super = p_item.superClassOf
                        for a_id, a in enumerate(a_super):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_super + '||' + str(a)
                            a_super = a_item    
                        a_super  = a_item
                    if loop == 'class_loop':                                  
                        row_out = (p_item,a_pref,a_sub,a_alt,a_def,a_note,a_isby,a_super)
                    else:
                        row_out = (p_item,a_pref,a_sub,a_dom,a_rng,a_func,
                                   a_alt,a_def,a_note)
                    csv_out.writerow(row_out)                               
                    cur_list.append(p_item)
                    x = x + 1
    print('Total unique IDs written to file:', x)  
    print('The annotation extraction for the', loop, 'is completed.') 

You can inspect this actual file of labels and annotations using this routine:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv')

df

We will add Wikipedia pages as a third source for informing our machine learning tests and experiments in our next installment.

Untested Potentials

One area in extended NetworkX capabilities that we do not test here is community structure using the Louvain Community Detection package.

Additional Documentation

Here are additional resources on network analysis and NetworkX:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Schema.org Markup

headline:
CWPK #62: Network and Graph Analysis

alternativeHeadline:
Knowledge Graphs Deserve Attention in Their Own Right

author:

image:
https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:
In today's CWPK installment, we look at the Python NetworkX package as a graph analytics capability and calculate many standard network (graph) metrics. We also discuss NetworkX as a basis for staging data for machine learning.

articleBody:
see above

datePublished:

Leave a Reply

Your email address will not be published.