Posted:September 10, 2020

CWPK #33: A Python Package, Part I: The Annotation Extractor

Generalization, Packaging, and Complexity Compel More Powerful Tools

Over the past installments of this Cooking with Python and KBpedia series, we have been building up larger and more complex routines from our coding blocks. This approach has been great for learning and prototyping, but does not readily support building maintainable applications. This is a natural evolution in any code development that is moving towards real use and deployment. It is a step that Project Jupyter is also taking in its efforts to transition from Notebook to JupyterLab (see here). Their intent is to provide a complete code development environment as well as one suitable for interactive notebooks.

Recent announcements aside, we picked the Spyder IDE and installed it in CWPK #11 for these same functional reasons, and will stay with it throughout this series because of its maturity and degree of acceptance within the data science community. But, JupyterLab looks to be a promising development.

Whatever the tool, there comes a time when code proliferation and the need to manage it to a release condition warrants moving beyond prototyping. Now is that time with our project.

We will use the packaging of our extraction routines begun in the last installment as our example case for how to proceed. We will continue to use Jupyter Notebook to discuss and present code snippets, but that material is now to be backed up with methods, code files, and modules, hopefully in an acceptable Python way. We will be using Spyder for these development purposes and referring to it in our documentation with screen captures and discussion as appropriate. We will also be releasing Python files as our installments proceed. But the transition to working code is more complicated than changing tool emphasis alone.

Note: Though we begin formal packaging of our routines in this installment, it is not until CWPK #46 that a sufficient number of modules are developed to warrant the actual release of the package.

An obsession for many programmers, and not a bad one by the way, is to embrace a DRY (don’t repeat yourself) mindset that seeks to reduce duplicative patterns and to find generalities within code. Apparently, if properly done, DRY leads to easier to maintain and understandable code. It also increases inter-dependencies and places a premium on the architecture and modularization (the packaging) of the code base. Definitions of functions and methods and their organization are part of this. By no means do I have the experience and background to offer any advice in these areas, other than to try myself to identify and generalize repeatable patterns. With these caveats in mind, let’s proceed to package some code.

The Objective of These Three Parts

In this installment and the two subsequent ones, we will complete an extraction ‘module’ for KBpedia, and organize and package its functions and defintions. We will set up four program files: 1) an __init__.py standard file that begins a package; 2) a __main__.py code that sets the standard module setup and starting assignments; 3) a config.py file where we set initial parameters for new runs and define our shared dictionaries; and 4) an extract.py set of methods governing our specific KBpedia extraction routines. The first two files are a sort of boilerplate. The third file is intended for where all initialization specifications are entered prior to any new runs. I am hoping to set this project up in such a way that only changes need to be made to the config.py file prior to any given run. The fourth file, extract.py, is the meat of the extraction logic and routines and represents the first of multiple clusters of related functionality. As we formulate these clusters, we will also have a need to look at our overall code and directory organization a few installments from now. For the time being, we will focus on these four starting program files.

As we discussed in CWPK #18, a module is an individual Python file (*.py) that may set assignments, load resources, define classes, conduct I/O, or define or execute functions. A package in Python is a directory structure that combines one or more Python modules into a coherent library or set of related functions. We are ultimately aiming to produce an entire package of Python functions for extracting, building, testing, or using KBpedia.

In the first part of this three-part mini-series we will complete a generic method for extracting annotations to file for any of our objects in the KBpedia system. We will be pushing the DRY concept a little harder in this installment. In the second part, we will transition that generalized annotation extraction code from the notebook to a Python package, and extend our general approach to structure extraction. And, in the third part, we will modify the structure extraction to support individual typology files and complete the steps to a complete KBpedia extraction package. It is this baseline package to which we will add further modules as the remaining CWPK series proceeds.

Starting Routine

We again start with our standard opening routine. This set of statements, by the way, will be moved to the __main__.py module, with the file declarations going to the config.py module.

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service (the ‘raw’ version) or local files. The example below is based on using local files (though replace with your own local directory specification). If loading from MyBinder, replace with the lines that are commented (#) out.
kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# kbpedia = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

More Initial Configuration

As noted in the objective, we also will codify the starting dictionaries as we defined in CWPK #32. As we begin packaging, these next two dictionary components will be moved to the config.py module.

typol_dict = {
             'ActionTypes'           : 'kko.ActionTypes',
             'AdjunctualAttributes'  : 'kko.AdjunctualAttributes',
             'Agents'                : 'kko.Agents',
             'Animals'               : 'kko.Animals',
             'AreaRegion'            : 'kko.AreaRegion',
             'Artifacts'             : 'kko.Artifacts',
             'Associatives'          : 'kko.Associatives',
             'AtomsElements'         : 'kko.AtomsElements',
             'AttributeTypes'        : 'kko.AttributeTypes',
             'AudioInfo'             : 'kko.AudioInfo',
             'AVInfo'                : 'kko.AVInfo',
             'BiologicalProcesses'   : 'kko.BiologicalProcesses',
             'Chemistry'             : 'kko.Chemistry',
             'Concepts'              : 'kko.Concepts',
             'ConceptualSystems'     : 'kko.ConceptualSystems',
             'Constituents'          : 'kko.Constituents',
             'ContextualAttributes'  : 'kko.ContextualAttributes',
             'CopulativeRelations'   : 'kko.CopulativeRelations',
             'Denotatives'           : 'kko.Denotatives',
             'DirectRelations'       : 'kko.DirectRelations',
             'Diseases'              : 'kko.Diseases',
             'Drugs'                 : 'kko.Drugs',
             'EconomicSystems'       : 'kko.EconomicSystems',
             'EmergentKnowledge'     : 'kko.EmergentKnowledge',
             'Eukaryotes'            : 'kko.Eukaryotes',
             'EventTypes'            : 'kko.EventTypes',
             'Facilities'            : 'kko.Facilities',
             'FoodDrink'             : 'kko.FoodDrink',
             'Forms'                 : 'kko.Forms',
             'Generals'              : 'kko.Generals',
             'Geopolitical'          : 'kko.Geopolitical',
             'Indexes'               : 'kko.Indexes',
             'Information'           : 'kko.Information',
             'InquiryMethods'        : 'kko.InquiryMethods',
             'IntrinsicAttributes'   : 'kko.IntrinsicAttributes',
             'KnowledgeDomains'      : 'kko.KnowledgeDomains',
             'LearningProcesses'     : 'kko.LearningProcesses',
             'LivingThings'          : 'kko.LivingThings',
             'LocationPlace'         : 'kko.LocationPlace',
             'Manifestations'        : 'kko.Manifestations',
             'MediativeRelations'    : 'kko.MediativeRelations',
             'Methodeutic'           : 'kko.Methodeutic',
             'NaturalMatter'         : 'kko.NaturalMatter',
             'NaturalPhenomena'      : 'kko.NaturalPhenomena',
             'NaturalSubstances'     : 'kko.NaturalSubstances',
             'OrganicChemistry'      : 'kko.OrganicChemistry',
             'OrganicMatter'         : 'kko.OrganicMatter',
             'Organizations'         : 'kko.Organizations',
             'Persons'               : 'kko.Persons',
             'Places'                : 'kko.Places',
             'Plants'                : 'kko.Plants',
             'Predications'          : 'kko.Predications',
             'PrimarySectorProduct'  : 'kko.PrimarySectorProduct',
             'Products'              : 'kko.Products',
             'Prokaryotes'           : 'kko.Prokaryotes',
             'ProtistsFungus'        : 'kko.ProtistsFungus',
             'RelationTypes'         : 'kko.RelationTypes',
             'RepresentationTypes'   : 'kko.RepresentationTypes',
             'SecondarySectorProduct': 'kko.SecondarySectorProduct',
             'Shapes'                : 'kko.Shapes',
             'SituationTypes'        : 'kko.SituationTypes',
             'SocialSystems'         : 'kko.SocialSystems',
             'Society'               : 'kko.Society',
             'SpaceTypes'            : 'kko.SpaceTypes',
             'StructuredInfo'        : 'kko.StructuredInfo',
             'Symbolic'              : 'kko.Symbolic',
             'Systems'               : 'kko.Systems',
             'TertiarySectorService' : 'kko.TertiarySectorService',
             'Times'                 : 'kko.Times',
             'TimeTypes'             : 'kko.TimeTypes',
             'TopicsCategories'      : 'kko.TopicsCategories',
             'VisualInfo'            : 'kko.VisualInfo',  
             'WrittenInfo'           : 'kko.WrittenInfo'
             }
prop_dict = {
            'objectProperties' : 'kko.predicateProperties',
            'dataProperties'   : 'kko.predicateDataProperties',
            'representations'  : 'kko.representations',
            }

The Generic Annotation Routine

So, now we come to the heart of the generic annotation extraction routine. For grins as much as anything else, I have wanted to take the DRY perspective and create a generic annotation extractor that could apply to any object or any aggregations of objects within KBpedia. I first tested it with the structure dictionary (typol_dict) and then generalized arguments and adding some additional extractors to handle properties (using prop_dict) as well. The routine as shown below accomplishes our desired extraction objectives.

You can Run this routine, but also change some of the switches to test class versus property extractions as well. To go through the entire set of typologies (typol_dict) takes about 8 minutes to process on a conventional desktop. All other combos including those for properties run much quicker.

I provide line-by-line comments as appropriate to capture the changes needed to generalize this routine. I also add some comments about how we will then break this code block apart in order to conform with the setup and configuration approach. Here is the routine, with the comments detailed below it:

import csv                                                              # #1

def render_using_label(entity):                                         # #14
    return entity.label.first() or entity.name
set_render_func(render_using_label)

x = 1                                                                   # #2
cur_list = []
class_loop = 0
property_loop = 1                                                       # #3
loop = property_loop                                                    # #15
loop_list = prop_dict.values()                                          # #4
print('Beginning annotation extraction . . .') 
out_file = 'C:/1-PythonProjects/kbpedia/sandbox/prop_annot_out.csv'     # #15
p_set = ''
with open(out_file, mode='w', encoding='utf8', newline='') as output:
    csv_out = csv.writer(output)                                        # #5
    if loop == class_loop:                                              # #6, #15
        header = ['id', 'prefLabel', 'subClassOf', 'altLabel', 'definition', 'editorialNote']
    else:
        header = ['id', 'prefLabel', 'subClassOf', 'domain', 'range', 'functional', 'altLabel', 
                  'definition', 'editorialNote']
    csv_out.writerow(header)    
    for value in loop_list:                                             # #7
        print('   . . . processing', value)                                           
        root = eval(value)                                              # #8                 
        p_set = root.descendants()                                      # #9, #15
        if root == kko.representations:                                 # #10
            p_set.remove(backwardCompatibleWith)
            p_set.remove(deprecated)
            p_set.remove(incompatibleWith)
            p_set.remove(priorVersion)
            p_set.remove(versionInfo)
            p_set.remove(isDefinedBy)
            p_set.remove(label)
            p_set.remove(seeAlso)
        for p_item in p_set:
            if p_item not in cur_list:                                  # #11
                a_pref = p_item.prefLabel
                a_pref = str(a_pref)[1:-1].strip('"\'')                 # #12
                a_sup = p_item.is_a
                for a_id, a in enumerate(a_sup):                        # #13
                    a_item = str(a)
                    if a_id > 0:
                        a_item = a_sup + '||' + str(a)
                    a_sup  = a_item
                if loop == property_loop:                               # #3     
                     a_dom  = p_item.domain
                     a_dom  = str(a_dom)[1:-1]
                     a_rng  = p_item.range
                     a_rng  = str(a_rng)[1:-1]
                     a_func = ''
                a_item = ''
                a_alt  = p_item.altLabel
                for a_id, a in enumerate(a_alt):
                    a_item = str(a)
                    if a_id > 0:
                        a_item = a_alt + '||' + str(a)
                    a_alt  = a_item    
                a_alt  = a_item
                a_def  = p_item.definition
                a_def = str(a_def).strip('[]')
                a_note = p_item.editorialNote
                a_note = str(a_note)[1:-1]
                if loop == class_loop:                                  # #6
                    row_out = (p_item,a_pref,a_sup,a_alt,a_def,a_note)
                else:
                    row_out = (p_item,a_pref,a_sup,a_dom,a_rng,a_func,a_alt,a_def,a_note)
                csv_out.writerow(row_out)                               # #1
                cur_list.append(p_item)
                x = x + 1
print('Total rows written to file:', x)                                 # #16
Beginning annotation extraction . . .
. . . processing kko.predicateProperties
. . . processing kko.predicateDataProperties
. . . processing kko.representations
Total rows written to file: 4843

Here are some of the specific changes to the routine above, keyed by number, to accommodate our current generic and DRY needs versus the first prototype presented in the earlier CWPK #30:

  1. We need to import the csv module at this point to make sure we can format longer text (definitions, especially) with the proper escaping of delimiting characters such as commas, etc.
  2. We’re putting some temporary counters in to keep track of the number of items we process
  3. Our generic annotation extraction method allows us to specify whether we are processing classes or properties
  4. Our big, or outer, loop is to cycle over the entries in our starting dictionary. Each one of these is a root with a set of child elements
  5. Here is where we switch out the writer to enable proper escaping of large text strings, etc., for CSV
  6. We’re checking on whether it is classes or properties we are looping over, and switching the number of columns thus needed for the outputs. The next code enables us to put a single-row header in our CSV files to label the output fields
  7. We take the big chunks of the combined roots in our starting dictionaries
  8. And we convert them to strings for easier later manipulation (also see the prior installment for cautions about eh eval() method
  9. The heart of this routine is to grab all of the descendant sub-items from our starting root
  10. This is a temporary kludge because possibly namespace or assignment errors require us to trap these annotations from our standard set; these properties are all part of the starting core KKO ontology ‘stub’
  11. Since there are many duplicates across our groupings, this check ensures we are only adding new assignments to our results. It effectively is a duplicate-removal routine
  12. We need to make some one-off string changes in order for our actual output to conform to an expected CSV file
  13. As discussed in prior CWPK installments, some record fields allow for more than one entry. This general routine loops over those sub-set members, making the format changes and commitments as indicated
  14. This part of the code block will be moved to the setup.py module, since how we want to render our extractions will be shared across modules
  15. Will move all of these items to the config.py module
  16. A little feedback for grins.

If you inspect the code base, for example, you will see that many of the parts above have been broken out into different files.

BTW, if you want to see the members of the outer loop set, you can do so with this code snippet (set your own root):

root = kko.representations                 
p_set = root.descendants()
print(p_set)
length = len(p_set)
print(length)

Based on the changes described in the comment notes and embedding this generic annotation routine into its own method, annot_extractor, will end up with this deployed code structure:

__main__.py material
config.py material

def annot_extractor (arg1, arg2)

We’re now ready to migrate this notebook code to a formal Python package and to extend the method to the structure extractor, the topics of our next installment.

Additional Documentation

Style guidelines and coding standards should be near at hand whenever you are writing code. That is because code is meant to be shared and understood, and conventions and lessons regarding readability are a key part of that. Here are some references useful for whatever work you choose to do with Python:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Schema.org Markup

headline:
CWPK #33: A Python Package, Part I: The Annotation Extractor

alternativeHeadline:
Generalization, Packaging, and Complexity Compel More Powerful Tools

author:

image:
https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:
In this CWPK installment and the two subsequent ones, we will complete an extraction 'module' for KBpedia, and organize and package its functions and defintions.

articleBody:
see above

datePublished:

Leave a Reply

Your email address will not be published. Required fields are marked *