Posted:September 14, 2020

CWPK #35: A Python Module, Part III: Custom Extractions and Finish Packaging

Completing the Extraction Methods and Formal Packaging

This last installment in our mini-series of packaging our Cooking with Python and KBpedia project will add one more flexible extraction routine, and complete packaging the cowpoke extraction module. This completion will set the template for how we will add additional clusters of functionality as we progress through the CWPK series.

The new extraction routine is geared more to later analysis and use of KBpedia than our current extraction routines. The routines we have developed so far are intended for nearly complete bulk extractions of KBpedia. We would like to add more fine-grained ways to extract various ‘slices’ from KBpedia, especially those that may arise from individual typology changes or SPARQL queries. We could benefit from having an intermediate specification for extractions that could be used for directing results from other functions or analyses.

Steps to Package a Project so Far

To summarize, here are the major steps we have discovered so far to transition from prototype code in the notebook to a Python package:

  1. Generalize the prototype routines in the notebook
  2. Find the appropriate .. Lib/site-packages directory and define a new directory with your package name, short lowercase best
  3. Create an __init__.py file in that same directory (and any subsequent sub-package directories should you define the project deeper); add basic statements similar to above
  4. Create a __main__.py file for your shared functions; it is a toss-up whether shared variables should also go here or in __init__.py
  5. If you want a single point of editing for changing inputs for a given run via a ‘record’ metaphor, follow something akin to what I began with the config.py
  6. Create a my_new_functions.py module where the bulk of your new work resides (extraction in our current instance). The approach I found helpful was to NOT wrap my prototype functions in a function definition at first. Only in the next step, once the interpreter is able to step through the new functions without errors, do I then wrap the routines in definitions
  7. In an interactive environment (such as Jupyter Notebook), start with a clean kernel and try to import myproject (where ‘myproject’ is the current bundle of functionality you are working on). Remember, an import will run the scripts as encountered, and if you have points of failure due to undefined variables or whatever, the traceback on the interpreter will tell you what the problem is and offer some brief diagnostics. Repeat until the new function code processes without error, then wrap in a definition, and move on to the next code block
  8. As defined functions are built, try to look for generalities of function and specification and desired inputs and outputs. These provide the grist for continued re-factoring of your code
  9. Document your code, which is only just beginning for this KBpedia project. I’ve been documenting much in the notebook pages, but not yet enough in the code itself.

Objectives for This Installment

Here is what I want to accomplish and wrap up in this installment focusing on custom extractions. I want to:

  • Arbitrarily select one to many classes or properties for driving an extraction
  • Define one or multiple starting points for descendants() or one or multiple individual starting points. This entry point is provided by the variable root in our existing extraction routines
  • Allow the extract_deck specification to also define the rendering method (see next)
  • For iterations, specify input and output file forms, so need: iterator, base + extension logic, relate to existing annotations, etc., as necessary
  • This suggests to reuse and build from the existing extraction routines, and
  • Improve the file-based and -named orientation of the routines.

The Render Function

You may recall from the tip in CWPK #29 that owlready2 comes with three rendering methods for its results: 1) a default method that has a short namespace prefix appended to all classes and properties; 2) a label method where no prefixes are provided; and 3) a full iri method where all three components of the subject-predicate-object (s-p-o) semantic triple are given their complete IRI. If you recall, here are those three function calls:

set_render_func(default_render_func)

set_render_func(render_using_label)

set_render_func(render_using_iri)

To provide these choices, we will add a render method specification in the extract_deck and provide the switch at the top of our extraction routines. We will use the same italicized names to specify which of the three rendering options has been chosen.

A Custom Extractor

The basic realization is that with just a few additions we are able to allow customization of our existing extraction routines. Initially, I thought I would need to write entirely new routines. But, fortunately, apparently our existing routines already are sufficiently general to enable this customization.

Since it is a bit simpler, we will use the struct_extractor function to show where the customization enhancements need to go. We will also provide the code snippet where the insertion is noted. I provide comments on these additions below the code listing.

Note these same changes are applied to the annot_extractor function as well (not shown). You can inspect the updated extraction module at the conclusion of this installment.

OK, so let’s explain these customizations:

def struct_extractor(**extract_deck):
    print('Beginning structure extraction . . .')
# 1 - render method goes here    
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return
# 2 - note about custom extractions
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    descent = extract_deck.get('descent') 
    single = extract_deck.get('single') 
    x = 1
    cur_list = []
    a_set = []
    s_set = []
    new_class = 'owl:Thing'
# 5 - what gets passed to 'output'
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)
        if loop == 'class_loop':                                             
            header = ['id', 'subClassOf', 'parent']
            p_item = 'rdfs:subClassOf'
        else:
            header = ['id', 'subPropertyOf', 'parent']
            p_item = 'rdfs:subPropertyOf'
        csv_out.writerow(header)       
# 3 - what gets passed to 'loop_list' 
        for value in loop_list:
            print('   . . . processing', value)                                           
            root = eval(value)
# 4 - descendant or single here
            if descent_type == 'descent':
                a_set = root.descendants()
                a_set = set(a_set)
                s_set = a_set.union(s_set)
            elif descent_type == 'single':
                a_set = root
                s_set.append(a_set)
            else:
                print('You have assigned an incorrect descent method--execution stopping.')
                return                         
        print('   . . . processing consolidated set.')
        for s_item in s_set:
            o_set = s_item.is_a
            for o_item in o_set:
                row_out = (s_item,p_item,o_item)
                csv_out.writerow(row_out)
                if loop == 'class_loop':
                    if s_item not in cur_list:                
                        row_out = (s_item,p_item,new_class)
                        csv_out.writerow(row_out)
                cur_list.append(s_item)
                x = x + 1
    print('Total unique IDs written to file:', x) 

The notes that follow pertain to the code listing above.

The render method (#1) is just a simple switch set in the configuration file. Only three keyword options are allowed; if a wrong keyword is entered, the error is flagged and the routine ends. We also added a ‘render’ assignment at the top of the code block.

What now makes this routine (#2) a custom one is the use of the configurable custom_dict and its configuration settings. The custom_dict dictionary is specified by assigning to the loop_list (#3). The custom_dict dictionary can take one or many key:value pairs. The first item, the key, should take the name that you wish to use as the internal variable name. The second item, the value, should correspond to the property or class with its namespace prefix. Here are the general rules and options available for a custom extraction:

  • You may enter properties OR classes into the custom_dict dictionary, but not both, in your pre-run configurations
  • The ‘iri‘ switch for the renderer is best suited for the struct_extractor function. It should probably not be used for annotations given the large number of output columns and loss of subsequent readability when using the full IRI. The choice of actual prefix is likely not that important since it is easy to do global search-and-replaces when in bulk mode
  • You may retrieve items in the custom_dict dictionary either singly or all of its descendants, depending on the use of the ‘single’ and ‘descent’ keyword options (see #4 next).

Item #4 is another switch to either run the entries in the custom_dict dictionary as single ‘roots’ (thus no sub-classes or sub-properties) or with all descendants. The descent_type has been added to the extract_deck settings, plus we added the related assignments to the beginning of this code block.

The last generalized capability we wanted to capture was the ability to print out all of the structural aspects of KBpedia’s typologies, which suggested some code changes at roughly #5 above. While I am sure I could have figured out a way to do this, because of interactions with the other customizations this addition proved to be more complicated than warranted. So, rather than spend undue time trying to cram everything into a single, generic function (struct_extractor), I decided the easier and quicker choice was to create its own function, picking up on many of the processing constructs developed for the other extractor routines.

Basically, what we want in a typology extract is:

  • Separate extractions of individual typologies to their own named files
  • Removal of the need to find unique resources across multiple typologies. Rather, the intent is to capture the full scope of structural (subClassOf aspects in each typology
  • A design that enables us to load a typology as an individual ontology or knowledge graph into a tool such as Protégé.

By focusing on a special extractor limited to classes, typologies, structure, and single output files per typology, we were able to make the function rather quickly and simply. Here is the result, the typol_extractor:

def typol_extractor(**extract_deck):
    print('Beginning structure extraction . . .')
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    class_loop = extract_deck.get('class_loop')
    base = extract_deck.get('base')
    ext = extract_deck.get('ext')
    new_class = 'owl:Thing'
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    header = ['id', 'subClassOf', 'parent']
    p_item = 'rdfs:subClassOf'
    for value in loop_list:
        print('   . . . processing', value)
        x = 1
        s_set = []
        cur_list = []
        root = eval(value)
        s_set = root.descendants()
        frag = value.replace('kko.','')
        out_file = (base + frag + ext)
        with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           
            csv_out = csv.writer(output)
            csv_out.writerow(header)       
            for s_item in s_set:
                o_set = s_item.is_a
                for o_item in o_set:
                    row_out = (s_item,p_item,o_item)
                    csv_out.writerow(row_out)
                    if s_item not in cur_list:                
                        row_out = (s_item,p_item,new_class)
                        csv_out.writerow(row_out)
                cur_list.append(s_item)
                x = x + 1
        output.close()         
        print('Total unique IDs written to file:', x)

Two absolute essentials for this routine are to set the 'loop' key to 'class_loop' and to set the 'loop_list' key to typol_dict.values().

Note the code in the middle of the routine that creates the file name after replacing (removing) the ‘kko.’ prefix from the value name in the dictionary. We also needed to add two further entries to the extract_deck dictionary.

With the caveat that your local file structure is likely different than what we set up for this project, should it be similar the following commands can be used to run these routines. Should you test different possibilities, make sure your input specifications in the extract_deck are modified appropriately. Remember, to always work from copies so that you may restore critical files in the case of an inadvertent overwrite.

Here are the commands:

from cowpoke.__main__ import *
from cowpoke.config import *
import cowpoke
import owlready2

cowpoke.typol_extractor(**cowpoke.extract_deck)

The extract.py File

Again, assuming you have set up your files and directories similar to what we have suggested, you can inspect the resulting extractor code in this new module (modify the path as necessary):

with open(r'C:\1-PythonProjects\Python\Lib\site-packages\cowpoke\extract.py', 'r') as f:
    print(f.read())

Summary of the Module

OK, so we are now done with the development and packaging of the extractor module for cowpoke. Our efforts resulted in the addition of four files under the ‘cowpoke’ directory. These files are:

  • The __init__.py file that indicates the cowpoke package
  • The __main__.py file where shared start-up functions reside
  • The config.py file where we store our dictionaries and where we specify new run settings in the special extract_deck dictionary, and
  • The extract.py module where all of our extraction routines are housed.

This module is supported by three dictionaries (and the fourth special one for the run configurations):

  • The typol_dict dictionary of typologies
  • The prop_dict dictionary of top-level property roots
  • The custom_dict dictionary for tailored starting point extractions, and
  • The extract_deck special dictionary for extraction run settings.

In turn, most of these dictionaries can also be matched with three different extractor routines or functions:

  • The annot_extractor function for extracting annotations
  • The struct_extractor function for extracting the is-a relations in KBpedia, and
  • The typol_extractor dedicated function for extracting out the individual typologies into individual files.

In our next CWPK installment we will discuss how we might manipulate this extracted information in a bulk manner using spreadsheets and other tools. These same extracted files, perhaps after bulk manipulations or other edits and changes, will then form the basis for the input files that we will use to build new versions of KBpedia (or your own extensions and changes to it) from scratch. We are now half-way around our roundtrip.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Schema.org Markup

headline:
CWPK #35: A Python Module, Part III: Custom Extractions and Finish Packaging

alternativeHeadline:
Completing the Extraction Methods and Formal Packaging

author:

image:
https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:
This last installment in our mini-series of packaging our 'Cooking with Python and KBpedia' project will add one more flexible extraction routine, and complete packaging the cowpoke extraction module. This completion will set the template for how we will add additional clusters of functionality as we progress through the CWPK series.

articleBody:
see above

datePublished:

2 thoughts on “CWPK #35: A Python Module, Part III: Custom Extractions and Finish Packaging

  1. I think there was a mistake in the cowpoke typology extractor code. In particular, should run_deck.get(‘render’) be actually extractor_deck.get(‘render’)?

  2. Hi Varun,

    Yes, good catch, you are correct. My first version lumped both build and extract configurations under the ‘run_deck’ function, but I found it made sense to split them. I have updated the post and the *.ipynb file.

    Thanks, Mike

Leave a Reply

Your email address will not be published. Required fields are marked *