Posted:September 11, 2020

CWPK #34: A Python Module, Part II: Packaging and The Structure Extractor

Moving from Notebook to Package Proved Perplexing

This installment of the Cooking with Python and KBpedia series is the second of a three-part mini-series on writing and packaging a formal Python project. The previous installment described a DRY (don’t repeat yourself) approach to how to generalize our annotation extraction routine. This installment describes how to transition that code from Jupyter Notebook interactive code to a formally organized Python package. We also extend our generalized approach to the structure extractor.

In this installment I am working with the notebook and the Spyder IDE in tandem. The notebook is the source of the initial prototype code. It is also the testbed for seeing if the package may be imported and is working properly. We use Spyder for all of the final code development, including moving into functions and classes and organizing by files. We also start to learn some of its IDE features, such as auto-complete, which is a nice way to test questions about namespaces and local and global variables.

As noted in earlier installments, a Python ‘module’ is a single script file (in the form of my_file.py) that itself may contain multiple functions, variable declarations, class (object) definitions, and the like, kept in this single file because of their related functionality. A ‘package’ in Python is a directory with at least one module and (generally) a standard __init__.py file that informs Python a package is available and its name. Python packages and modules are named with lower case. A package name is best when short and without underscores. A module may use underscores to better convey its purpose, such as do_something.py.

For our project based on Cooking with Python and KBpedia (CWPK), we will pick up on this acronym and name our project ‘cowpoke‘. The functional module we are starting the project with is extract.py, the module for the extraction routines we have been developing over the past few installments..

Perplexing Questions

While it is true the Python organization has some thorough tutorials, referenced in the concluding Additional Documentation, I found it surprisingly difficult to figure out how to move my Jupyter Notebook prototypes to a packaged Python program. I could see that logical modules (single Python scripts, *.py) made sense, and that there were going to be shared functions across those modules. I could also see that I wanted to use a standard set of variable descriptions in order to specify ‘record-like’ inputs to the routines. My hope was to segregate all of the input information required for a new major exercise of cowpoke into the editing of a single file. That would make configuring a new run a simple process.

I read and tried many tutorials trying to figure out an architecture and design for this packaging. I found the tutorials helpful at a broad, structural level of what goes into a package and how to refer and import other parts, but the nuances of where and how to use classes and functions and how to best share some variables and specifications across modules remained opaque to me. Here are some of the questions and answers I needed to discover before I could make progress:

1. Where do I put the files to be seen by the notebook and the project?

After installing Python and setting up the environment noted in installments CWPK #9#11 you should have many packages already on your system, including for Spyder and Jupyter Notebook. There are at least two listings of full packages in different locations. To re-discover what your Python paths are, Run this cell:

import sys
print(sys.path)

You want to find the site packages directory under your Python library (mine is C:\1-PythonProjects\Python\lib\site-packages). We will define the ‘cowpoke‘ directory under this parent and also point our Spyder project to it. (NB: Of course, you can locate your package directory anywhere you want, but you would need to add that location to your path as well, and later configuration steps may also require customization.)

2. What is the role of class and defined variables?

I know the major functions I have been prototyping, such as the annotation extractor from the last CWPK #33 installment, need to be formalized as a defined function (the def function_name statement). Going into this packaging, however, it is not clear to me whether I should package multiple function definitions under one class (some tutorials seem to so suggest) or where and how I need to declare variables such as loop that are part of a run configuration.

One advantage of putting both variables and functions under a single class is that they can be handled as a unit. On the other hand, having a separate class of only input variables seems to be the best arrangement for a record orientation (see next question #4). In practice, I chose to embrace both types.

3. What is the role of self and where to introduce or use?

The question of the role of self perplexed me for some time. On the one hand, self is not a reserved keyword in Python, but it is used frequently by convention. Class variables come in two flavors. One flavor is when the variable value is universal to all instances of class. Every instance of this class will share the same value for this variable. It is declared simply after first defining the class and outside of any methods:

variable = my_variable

In contrast, instance variables, which is where self is used, are variables with values specific to each instance of class. The values of one instance typically vary from the values of another instance. Class instance variables should be declared within a method, often with this kind of form, as this example from the Additional Documentation shows:

class SomeClass:
variable_1 = “ This is a class variable”
variable_2 = 100 #this is also a class variable.

def __init__(self, param1, param2):
self.instance_var1 = param1
#instance_var1 is a instance variable
self.instance_var2 = param2
#instance_var2 is a instance variable

In this recipe, we are assigning self by convention to the first parameter of the function (method). We can then access the values of the instance variable as declared in the definition via the self convention, also without the need to pass additional arguments or parameters, making for simpler use and declarations. (NB: You may name this first parameter something other than self, but that is likely confusing since it goes against the convention.)

Importantly, know we may use this same approach to assign self as the first parameter for instance methods, in addition to instance variables. For either instance variables or methods, Python explicitly passes the current instance and its arguments (self) as the first argument to the instance call.

At any rate, for our interest of being able to pass variable assignments from a separate config.py file to a local extraction routine, the approach using the universal class variable is the right form. But, is it the best form?

4. What is the best practice for initializing a record?

If one stands back and thinks about what we are trying to do with our annotation extraction routine (as with other build or extraction steps), we see that we are trying to set a number of key parameters for what data we use and what branches we take during the routine. These parameters are, in effect, keywords used in the routines, the specific values of which (sources of data, what to loop over, etc.) vary by the specific instance of the extraction or build run we are currently invoking. This set-up sounds very much like a kind of ‘record’ format where we have certain method fields (such as output file or source of the looping data) that vary by run. This is equivalent to a key:value pair. In other words, we can treat our configuration specification as the input to a given run of the annotation extractor as a dictionary (dict) as we discussed in the last installment. The dict form looks to be the best form for our objective. We’ll see this use below.

5. What are the special privileges about __main__.py?

Another thing I saw while reading the background tutorials was reference to a more-or-less standard __main.__.py file. However, in looking at many of the packages installed in my current Python installation I saw that this construct is by no means universally used, though some packages do. Should I be using this format or not?

For two reasons my general desire is to remove this file. The first reason is because this file can be confused with the __main__ module. The second reason is because I could find no real clear guidance about best practices for the file except to keep it simple. That seemed to me thin gruel for keeping something I did not fully understand and found confusing. So, I initially decided not to use this form.

However, I found things broke when I tried to remove it. I assume with greater knowledge or more experience I might find the compelling recipe for simplifying this file away. But, it is easier to keep it and move on rather than get stuck on a question not central to our project.

6. What is the best practice for arranging internal imports across a project?

I think one of the reasons I did not see a simple answer to the above question is the fact I have not yet fully understood the relationships between global and local variables and module functions and inheritance, all of which require a sort of grokking, I suppose, of namespaces.

I plan to continue to return to these questions as I learn more with subsequent installments and code development. If I encounter new insights or better ways to do things, my current intent is to return to any prior installments, leave the existing text as is, and then add annotations as to what I learned. If you have not seen any of these notices by now, I guess I have not later discovered better approaches. (Note: I think I began to get a better understanding about namespaces on the return leg of our build ’roundtrip’, roughly about CWPK #40 from now, but I still have questions, even from that later vantage point.)

New File Definitions

As one may imagine, the transition from notebook to module package has resulted in some changes to the code. The first change, of course, was to split the code into the starting pieces, including adding the __init__.py that signals the available cowpoke package. Here is the new file structure:

|-- PythonProject                                              
|-- Python
|-- [Anaconda3 distribution]
|-- Lib
|-- site-packages # location to store files
|-- alot
|-- cowpoke # new project directory
|-- __init__.py # four new files here
|-- __main__.py
|-- config.py
|-- extract.py
|-- TBA
|-- TBA

At the top of each file we place our import statements, including references to other modules within the cowpoke project. Here is the statement at the top of __init__.py (which also includes some package identification boilerplate):

from cowpoke.__main__ import *
from cowpoke.config import *
from cowpoke.extract import *

I should note that the asterisk (*) character above tells the system to import all objects within the file, a practice that is generally not encouraged, though is common. It is discouraged because of the amount of objects brought into a current working space, which may pose name conflicts or a burdened system for larger projects. However, since our system is quite small and I do not foresee unmanageable namespace complexity, I use this simpler shorthand.

Our __main__.py contains the standard start-up script that we have recently been using for many installments. You can see this code and the entire file by Running the next cell (assuming you have been following this entire CWPK series and have stored earlier distribution files):

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service (the ‘raw’ version) or local files. The example below is based on using local files (though replace with your own local directory specification). If loading from MyBinder, replace with the lines that are commented (#) out.
with open(r'C:\1-PythonProjects\Python\Lib\site-packages\cowpoke\__main__.py', 'r') as f:
    print(f.read())

(NB: Remember the ‘r‘ switch on the file name is to treat the string as ‘raw’.)

We move our dictionary definitions to the config.py. Go ahead and inspect it in the next cell, but realized much has been added to this file due to subsequent coding steps in our project installments:

with open(r'C:\1-PythonProjects\Python\Lib\site-packages\cowpoke\config.py', 'r') as f:
    print(f.read())

We already had the class and property dictionaries as presented in the CWPK #33 installment. The key change notable for the config.py, which remember is intended for where we enter run specifications for a new run (build or extract) of the code, was to pull out our specifications for the annotation extractor. This new dictionary, the extract_deck, is expanded later to embrace other run parameters for additional functions. At the time of this initial set-up, however, the dictionary contained these relatively few entries:

extract_deck = {
"""This is the dictionary for the specifications of each
extraction run; what is its run deck.
"""
'property_loop' : '',
'class_loop' : '',
'loop' : 'property_loop',
'loop_list' : prop_dict.values(),
'out_file' : 'C:/1-PythonProjects/kbpedia/sandbox/prop_annot_out.csv',
}

These are the values passed to the new annotation extraction function, def annot_extractor, now migrated to the extract.py module. Here is the commented code block (which will not run on its own as a cell):

def annot_extractor(**extract_deck):                                   # define the method here, see note
    print('Beginning annotation extraction . . .') 
    loop_list = extract_deck.get('loop_list')                              # notice we are passing run_deck to current vars
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    a_dom = ''
    a_rng = ''
    a_func = ''
    """ These are internal counters used in this module's methods """
    p_set = ''
    x = 1
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output) 
             ...                                                       # remainder of code as prior installment . . . 

Note: Normally, a function definition is followed by its arguments in parentheses. The special notation of the double asterisks (**) signals to expect a variable list of keywords (more often in tutorials shown as ‘**kwargs‘), which is how we make the connection to the values of the keys in the extract_deck dictionary. We retrieve these values based on the .get() method shown in the next assignments. Note, as well, that positional arguments can also be treated in a similar way using the single asterisk (*) notation (‘*args‘).

At the command line or in an interactive notebook, we can run this function with the following call:

import cowpoke
cowpoke.annot_extractor(**cowpoke.extract_deck)

We are not calling it here given that your local config.py is not set up with the proper configuration parameters for this specific example.

These efforts complete our initial set-up on the Python cowpoke package.

Generalizing and Moving the Structure Extractor

You may want to relate the modified code in this section to the last state of our structure extraction routine, shown as the last code cell in CWPK #32.

We took that code, applied the generalization approaches earlier discussed, and added a set.union method to getting the unique list from a very large list of large sets. This approach using sets (that can be hashed) sped up what had been a linear lookup by about 10x. We also moved the general parameters to share the same extract_deck dictionary.

We made the same accommodations for processing properties v classes (and typologies). We wrapped the resulting code block into a defined function wrapper, similar for what we did for annotations, only now for (is-a) structure:

from owlready2 import * 
from cowpoke.config import *
from cowpoke.__main__ import *
import csv                                                
import types

world = World()

kko = []
kb = []
rc = []
core = []
skos = []
kb_src = master_deck.get('kb_src')                         # we get the build setting from config.py

if kb_src is None:
    kb_src = 'standard'
if kb_src == 'sandbox':
    kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
elif kb_src == 'standard':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
elif kb_src == 'extract':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/ontologies/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/ontologies/kko.owl'    
elif kb_src == 'full':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
else:
    print('You have entered an inaccurate source parameter for the build.')
skos_file = 'http://www.w3.org/2004/02/skos/core' 

kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')
def struct_extractor(**extract_deck):
    print('Beginning structure extraction . . .')
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    x = 1
    cur_list = []
    a_set = []
    s_set = []
#    r_default = ''                                                     # Series of variables needed later
#    r_label = ''                                                       #
#    r_iri = ''                                                         #
#    render = ''                                                        #
    new_class = 'owl:Thing'
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)
        if loop == class_loop:                                             
            header = ['id', 'subClassOf', 'parent']
            p_item = 'rdfs:subClassOf'
        else:
            header = ['id', 'subPropertyOf', 'parent']
            p_item = 'rdfs:subPropertyOf'
        csv_out.writerow(header)       
        for value in loop_list:
            print('   . . . processing', value)                                           
            root = eval(value)
            a_set = root.descendants()                         
            a_set = set(a_set)
            s_set = a_set.union(s_set)
        print('   . . . processing consolidated set.')
        for s_item in s_set:
            o_set = s_item.is_a
            for o_item in o_set:
                row_out = (s_item,p_item,o_item)
                csv_out.writerow(row_out)
                if loop == class_loop:
                    if s_item not in cur_list:                
                        row_out = (s_item,p_item,new_class)
                        csv_out.writerow(row_out)
                        cur_list.append(s_item)
                x = x + 1
    print('Total rows written to file:', x) 
struct_extractor(**extract_deck)
Beginning structure extraction . . .
. . . processing kko.predicateProperties
. . . processing kko.predicateDataProperties
. . . processing kko.representations
. . . processing consolidated set.
Total rows written to file: 9670

Again, since we can not guarantee the operating circumstance, you can try this on your own instance with the command:

cowpoke.struct_extractor(**cowpoke.extract_deck)

Note we’re using a prefixed cowpoke function to make the generic dictionary request. All we need to do before the run is to go to the config.py file, and make the value (right-hand side) changes to the extract_deck dictionary. Save the file, make sure your current notebook instance has been cleared, and enter the command above.

There aren’t any commercial-grade checks here to make sure you are not inadvertently overwriting a desired file. Loose code and routines such as what we are developing in this CWPK series warrant making frequent backups, and scrutinizing your config.py assignments before kicking off a run.

Additional Documentation

Here are additional guides resulting from the research in today’s installation:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Schema.org Markup

headline:
CWPK #34: A Python Module, Part II: Packaging and The Structure Extractor

alternativeHeadline:
Moving from Notebook to Package Proved Perplexing

author:

image:
https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:
This 'Cooking with Python and KBpedia' installment describes how to transition code from Jupyter Notebook interactive code to a formally organized Python package. We also extend our generalized approach to the structure extractor.

articleBody:
see above

datePublished:

2 thoughts on “CWPK #34: A Python Module, Part II: Packaging and The Structure Extractor

Leave a Reply

Your email address will not be published. Required fields are marked *