Completing the Extraction Methods and Formal Packaging
This last installment in our mini-series of packaging our Cooking with Python and KBpedia project will add one more flexible extraction routine, and complete packaging the cowpoke extraction module. This completion will set the template for how we will add additional clusters of functionality as we progress through the CWPK series.
The new extraction routine is geared more to later analysis and use of KBpedia than our current extraction routines. The routines we have developed so far are intended for nearly complete bulk extractions of KBpedia. We would like to add more fine-grained ways to extract various ‘slices’ from KBpedia, especially those that may arise from individual typology changes or SPARQL queries. We could benefit from having an intermediate specification for extractions that could be used for directing results from other functions or analyses.
Steps to Package a Project so Far
To summarize, here are the major steps we have discovered so far to transition from prototype code in the notebook to a Python package:
- Generalize the prototype routines in the notebook
- Find the appropriate
.. Lib/site-packagesdirectory and define a new directory with your package name, short lowercase best
- Create an
__init__.pyfile in that same directory (and any subsequent sub-package directories should you define the project deeper); add basic statements similar to above
- Create a
__main__.pyfile for your shared functions; it is a toss-up whether shared variables should also go here or in
- If you want a single point of editing for changing inputs for a given run via a ‘record’ metaphor, follow something akin to what I began with the
- Create a
my_new_functions.pymodule where the bulk of your new work resides (extraction in our current instance). The approach I found helpful was to NOT wrap my prototype functions in a function definition at first. Only in the next step, once the interpreter is able to step through the new functions without errors, do I then wrap the routines in definitions
- In an interactive environment (such as Jupyter Notebook), start with a clean kernel and try to
import myproject(where ‘myproject’ is the current bundle of functionality you are working on). Remember, an import will run the scripts as encountered, and if you have points of failure due to undefined variables or whatever, the traceback on the interpreter will tell you what the problem is and offer some brief diagnostics. Repeat until the new function code processes without error, then wrap in a definition, and move on to the next code block
- As defined functions are built, try to look for generalities of function and specification and desired inputs and outputs. These provide the grist for continued re-factoring of your code
- Document your code, which is only just beginning for this KBpedia project. I’ve been documenting much in the notebook pages, but not yet enough in the code itself.
Objectives for This Installment
Here is what I want to accomplish and wrap up in this installment focusing on custom extractions. I want to:
- Arbitrarily select one to many classes or properties for driving an extraction
- Define one or multiple starting points for descendants() or one or multiple individual starting points. This entry point is provided by the variable
rootin our existing extraction routines
- Allow the
extract_deckspecification to also define the rendering method (see next)
- For iterations, specify input and output file forms, so need: iterator, base + extension logic, relate to existing annotations, etc., as necessary
- This suggests to reuse and build from the existing extraction routines, and
- Improve the file-based and -named orientation of the routines.
The Render Function
You may recall from the tip in CWPK #29 that owlready2 comes with three rendering methods for its results: 1) a default method that has a short namespace prefix appended to all classes and properties; 2) a label method where no prefixes are provided; and 3) a full iri method where all three components of the subject-predicate-object (s-p-o) semantic triple are given their complete IRI. If you recall, here are those three function calls:
To provide these choices, we will add a
render method specification in the
extract_deck and provide the switch at the top of our extraction routines. We will use the same italicized names to specify which of the three rendering options has been chosen.
A Custom Extractor
The basic realization is that with just a few additions we are able to allow customization of our existing extraction routines. Initially, I thought I would need to write entirely new routines. But, fortunately, apparently our existing routines already are sufficiently general to enable this customization.
Since it is a bit simpler, we will use the
struct_extractor function to show where the customization enhancements need to go. We will also provide the code snippet where the insertion is noted. I provide comments on these additions below the code listing.
Note these same changes are applied to the
annot_extractor function as well (not shown). You can inspect the updated extraction module at the conclusion of this installment.
OK, so let’s explain these customizations:
def struct_extractor(**extract_deck): print('Beginning structure extraction . . .') # 1 - render method goes here r_default = '' r_label = '' r_iri = '' render = extract_deck.get('render') if render == 'r_default': set_render_func(default_render_func)elif render == 'r_label': set_render_func(render_using_label)elif render == 'r_iri': set_render_func(render_using_iri)else: print('You have assigned an incorrect render method--execution stopping.') return # 2 - note about custom extractions = extract_deck.get('loop_list') loop_list = extract_deck.get('loop') loop = extract_deck.get('out_file') out_file = extract_deck.get('class_loop') class_loop = extract_deck.get('property_loop') property_loop = extract_deck.get('descent_type') descent_type = extract_deck.get('descent') descent = extract_deck.get('single') single = 1 x =  cur_list =  a_set =  s_set = 'owl:Thing' new_class # 5 - what gets passed to 'output' with open(out_file, mode='w', encoding='utf8', newline='') as output: = csv.writer(output) csv_out if loop == 'class_loop': = ['id', 'subClassOf', 'parent'] header = 'rdfs:subClassOf' p_item else: = ['id', 'subPropertyOf', 'parent'] header = 'rdfs:subPropertyOf' p_item csv_out.writerow(header) # 3 - what gets passed to 'loop_list' for value in loop_list: print(' . . . processing', value) = eval(value) root # 4 - descendant or single here if descent_type == 'descent': = root.descendants() a_set = set(a_set) a_set = a_set.union(s_set) s_set elif descent_type == 'single': = root a_set s_set.append(a_set)else: print('You have assigned an incorrect descent method--execution stopping.') return print(' . . . processing consolidated set.') for s_item in s_set: = s_item.is_a o_set for o_item in o_set: = (s_item,p_item,o_item) row_out csv_out.writerow(row_out)if loop == 'class_loop': if s_item not in cur_list: = (s_item,p_item,new_class) row_out csv_out.writerow(row_out) cur_list.append(s_item)= x + 1 x print('Total unique IDs written to file:', x)
The notes that follow pertain to the code listing above.
The render method (#1) is just a simple switch set in the configuration file. Only three keyword options are allowed; if a wrong keyword is entered, the error is flagged and the routine ends. We also added a ‘render’ assignment at the top of the code block.
What now makes this routine (#2) a custom one is the use of the configurable
custom_dict and its configuration settings. The
custom_dict dictionary is specified by assigning to the
loop_list (#3). The
custom_dict dictionary can take one or many key:value pairs. The first item, the key, should take the name that you wish to use as the internal variable name. The second item, the value, should correspond to the property or class with its namespace prefix. Here are the general rules and options available for a custom extraction:
- You may enter properties OR classes into the
custom_dictdictionary, but not both, in your pre-run configurations
- The ‘
iri‘ switch for the renderer is best suited for the
struct_extractorfunction. It should probably not be used for annotations given the large number of output columns and loss of subsequent readability when using the full IRI. The choice of actual prefix is likely not that important since it is easy to do global search-and-replaces when in bulk mode
- You may retrieve items in the
custom_dictdictionary either singly or all of its descendants, depending on the use of the ‘single’ and ‘descent’ keyword options (see #4 next).
Item #4 is another switch to either run the entries in the
custom_dict dictionary as single ‘roots’ (thus no sub-classes or sub-properties) or with all descendants. The
descent_type has been added to the
extract_deck settings, plus we added the related assignments to the beginning of this code block.
The last generalized capability we wanted to capture was the ability to print out all of the structural aspects of KBpedia’s typologies, which suggested some code changes at roughly #5 above. While I am sure I could have figured out a way to do this, because of interactions with the other customizations this addition proved to be more complicated than warranted. So, rather than spend undue time trying to cram everything into a single, generic function (
struct_extractor), I decided the easier and quicker choice was to create its own function, picking up on many of the processing constructs developed for the other extractor routines.
Basically, what we want in a typology extract is:
- Separate extractions of individual typologies to their own named files
- Removal of the need to find unique resources across multiple typologies. Rather, the intent is to capture the full scope of structural (
subClassOfaspects in each typology
- A design that enables us to load a typology as an individual ontology or knowledge graph into a tool such as Protégé.
By focusing on a special extractor limited to classes, typologies, structure, and single output files per typology, we were able to make the function rather quickly and simply. Here is the result, the
def typol_extractor(**extract_deck): print('Beginning structure extraction . . .') r_default = '' r_label = '' r_iri = '' render = extract_deck.get('render') if render == 'r_default': set_render_func(default_render_func)elif render == 'r_label': set_render_func(render_using_label)elif render == 'r_iri': set_render_func(render_using_iri)else: print('You have assigned an incorrect render method--execution stopping.') return = extract_deck.get('loop_list') loop_list = extract_deck.get('loop') loop = extract_deck.get('class_loop') class_loop = extract_deck.get('base') base = extract_deck.get('ext') ext = 'owl:Thing' new_class if loop is not 'class_loop': print("Needs to be a 'class_loop'; returning program.") return = ['id', 'subClassOf', 'parent'] header = 'rdfs:subClassOf' p_item for value in loop_list: print(' . . . processing', value) = 1 x =  s_set =  cur_list = eval(value) root = root.descendants() s_set = value.replace('kko.','') frag = (base + frag + ext) out_file with open(out_file, mode='w', encoding='utf8', newline='') as output: = csv.writer(output) csv_out csv_out.writerow(header) for s_item in s_set: = s_item.is_a o_set for o_item in o_set: = (s_item,p_item,o_item) row_out csv_out.writerow(row_out)if s_item not in cur_list: = (s_item,p_item,new_class) row_out csv_out.writerow(row_out) cur_list.append(s_item)= x + 1 x output.close() print('Total unique IDs written to file:', x)
Two absolute essentials for this routine are to set the
'loop' key to
'class_loop' and to set the
'loop_list' key to
Note the code in the middle of the routine that creates the file name after replacing (removing) the ‘kko.’ prefix from the value name in the dictionary. We also needed to add two further entries to the
With the caveat that your local file structure is likely different than what we set up for this project, should it be similar the following commands can be used to run these routines. Should you test different possibilities, make sure your input specifications in the
extract_deck are modified appropriately. Remember, to always work from copies so that you may restore critical files in the case of an inadvertent overwrite.
Here are the commands:
from cowpoke.__main__ import *
from cowpoke.config import *
The extract.py File
Again, assuming you have set up your files and directories similar to what we have suggested, you can inspect the resulting extractor code in this new module (modify the path as necessary):
with open(r'C:\1-PythonProjects\Python\Lib\site-packages\cowpoke\extract.py', 'r') as f: print(f.read())
Summary of the Module
OK, so we are now done with the development and packaging of the extractor module for cowpoke. Our efforts resulted in the addition of four files under the ‘cowpoke’ directory. These files are:
__init__.pyfile that indicates the cowpoke package
__main__.pyfile where shared start-up functions reside
config.pyfile where we store our dictionaries and where we specify new run settings in the special
extract.pymodule where all of our extraction routines are housed.
This module is supported by three dictionaries (and the fourth special one for the run configurations):
typol_dictdictionary of typologies
prop_dictdictionary of top-level property roots
custom_dictdictionary for tailored starting point extractions, and
extract_deckspecial dictionary for extraction run settings.
In turn, most of these dictionaries can also be matched with three different extractor routines or functions:
annot_extractorfunction for extracting annotations
struct_extractorfunction for extracting the is-a relations in KBpedia, and
typol_extractordedicated function for extracting out the individual typologies into individual files.
In our next CWPK installment we will discuss how we might manipulate this extracted information in a bulk manner using spreadsheets and other tools. These same extracted files, perhaps after bulk manipulations or other edits and changes, will then form the basis for the input files that we will use to build new versions of KBpedia (or your own extensions and changes to it) from scratch. We are now half-way around our roundtrip.