Posted:September 22, 2020

We Build Up Our Ingest Routine to All Structure

Now that we have a template for structure builds in this Cooking with Python and KBpedia series, we continue to refine that template to generalize the routine and expand it to looping over multiple input files and to apply it to property structure as well. These are the topics we cover in this current installment, with a detour as I explain below.

In order to prep for today’s material, I encourage you to go back and look at the large routine we developed in the last installment. We can see three areas we need to address in order to generalize this routine:

  • First, last installment’s structure build routine (as designed) requires three passes to complete file ingest. Each one of those passes has a duplicate code section to convert our file input forms to required shorter versions. We would like to extract these duplicates as a helper function in order to lesson code complexity and improve readability
  • Second, we need a more generic way of specifying the input file or files to be processed by the routine, preferably including being able to loop over and process all of the files in a given input dictionary (as housed in config.py), and
  • Third, we would like to generalize the approach to dealing with class hierarchical structure to also deal with property ingest and hierarchical structure.

So, with these objectives in mind, let’s begin.

Adding a Helper Function

For reference, here is the code block in the prior installment that we repeat three times, and for which we would like to develop a helper function (BTW, this code block will not run here in isolation):

id = row['id']                                                 
parent = row['parent']                                         
id = id.replace('http://kbpedia.org/kko/rc/', 'rc.')          
id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
id_frag = id.replace('rc.', '')
id_frag = id_frag.replace('kko.', '')
parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') 
parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
parent = parent.replace('owl:', 'owl.')
parent_frag = parent.replace('rc.', '')
parent_frag = parent_frag.replace('kko.', '')
parent_frag = parent_frag.replace('owl.', '')

We will call our helper function row_clean since its purpose is to convert the full IRIs of the CSV input rows to shorter forms required by owlready2 (sometimes object names with a namespace prefix, other times just with the shortened object name). We also need these to work on either the subject of the row (‘id’) or the object of the row (‘parent’ in this case). That leads to four combinations of 2 row objects by 2 shortened forms.

Note that the second argument (‘iss’) passed to the function below is a keyword argument, always shown with the equal sign in the function definition. Also note sometimes, rather than an empty string as shown, if you assign the keyword argument a legitimate value when defined, that becomes the default assignment for that keyword and does not have to have a value assigned to it when called. (NB: Indeed, many built-in Python functions have multiple arguments that are infrequently exposed. I have found it frequently helpful to do a dir() on functions to discover their broader capabilities.)

### Here is the helper function

def row_clean(value, iss=''):                                # arg values come from calling code
    if iss == 'i_id':                                        # check to see which replacement method
        value = value.replace('http://kbpedia.org/kko/rc/', 'rc.')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        return value                                         # returns the calculated value to calling code
    if iss == 'i_id_frag':
        value = value.replace('http://kbpedia.org/kko/rc/', '')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', '')
        return value
    if iss == 'i_parent':
        value = value.replace('http://kbpedia.org/kko/rc/', 'rc.')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        value = value.replace('owl:', 'owl.')
        return value
    if iss == 'i_parent_frag':
        value = value.replace('http://kbpedia.org/kko/rc/', '')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', '')
        value = value.replace('owl:', '')
        return value
        
### Here is the code we will put in the main calling routine:
        
# r_id = row['id']                                           # this is the version we will actually keep
r_id = 'http://kbpedia.org/kko/rc/AlarmSignal'               # temporary assignment just to test code
# r_parent = row['parent']
r_parent = 'http://kbpedia.org/kko/rc/SoundsByType'
id = row_clean(r_id, iss='i_id')                             # send the two arguments to helper function
id_frag = row_clean(r_id, iss='i_id_frag')
parent = row_clean(r_parent, iss='i_parent')
parent_frag = row_clean(r_parent, iss='i_parent_frag')

print('id:', id)                                             # temporary print to check if results OK
print('id_frag', id_frag)
print('parent:', parent)
print('parent_frag:', parent_frag)

Because we have entered some direct assignments the code block above does Run (or <\shift+enter>).

Note in the main calling routine code that to get our routine values we are calling the row_clean function and passing the required two arguments: the value for either the ‘id’ or ‘parent’ in that row, and whether we want prefixed or shortened fragments.

I strongly suspect there are better and shorter ways to remove this duplicate code, but this approach with a helper function, even in a less optimal form, still has cut the original code length in half (36 lines to 18 lines due to three duplicates). Expect to see a similar form to this in our code going forward. (NB: I am finding that looking for these duplicate code blocks is forcing me to learn function definitions and seek shorter but more expressive forms.)

Looping Over Files

If you recall our extraction steps of getting flat CSV files out of KBpedia in CWPK #28 to CWPK #35, we can end up with close to 100 extraction files. These splits encourage modularity and are easier to work on or substitute. Still, when it comes time to building KBpedia back up again after we complete a roundtrip, a complete build requires we process many files. We thus need looping routines across our build files to automate this process.

The first thought is to simply put groupings of files in individual directories and then point the routine at a directory and instruct it to loop over all files. If we have concerns that the directories may have more file types than we want to process with our current routine, we could also introduce some file name string checks to filter by name, fragment, or extension. These options would enable us to generalize a file looping routine to apply to many conditions.

But, I’ve decided to take a different choice. Since our extractions are driven by Python dictionaries, and we can direct those extractions to any directory prefix, we can re-use these same specifications for build processes. Should we later discover that a general file harvester makes sense, we can generalize at that time from this dictionary design. Also, by applying the same dictionary approach to extraction or building, we help reinforce our roundtripping mindset in how we name and process files.

So, we already have the unique names that distinguish our input classes (in the typol_dict dictionary in config.py) and our properties (in the prop_dict dictionary), and foresee using additional dictionaries going forward in this CWPK series. We only need enter a directory root and the appropriate dictionary to loop over the unique terms associated with our various building blocks. For classes, the typology listing is a great lookup.

We will take our generic class build template from the last installment, and put it into a function that loops over opening our file set, running the routine, and then saving to our desired output location. For now, to get the logic right, I will just set this up as a wrapper before actually plopping in the full build loop routine. (Note: we have to import a couple of modules because we have not yet fully set the environment for today’s installment):

from cowpoke.config import *
import csv                                                

def class_builder(**build_deck):
    print('Beginning KBpedia class structure build . . .')
    r_default = ''
    r_label = ''
    r_iri = ''
# probably want the run specification here (see CWPK #35 for render in struct_extractor)
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)
        frag = loopval.replace('kko.','')
        in_file = (base + frag + ext)
        x = 1
        with open(in_file, mode='r', encoding='utf8') as input:                                           
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
            for row in reader:
## Here is where we place the real class build routine                
                if x <= 2:
                    r_id = row['id']
                    r_parent = row['parent']
                    print(r_id, r_parent)
                    x = x + 1
        input.close()
        
class_builder(**build_deck)        

OK. We now know how to loop over our class build input files. Now, we can Kernel → Restart & Clear Outputs → and then Restart and Clear All Outputs (which should be a familiar red button to you if using Jupyter Notebook) to get ourselves to a clean starting place, to begin setting up our structure build environmment.

Setting Up the Build Environment

As before with our extract routines, we now have a build_deck dictionary of build configuration settings in config.py. If you see some unfamiliar switches as we proceed through this build process, you may want to inspect that file. The settings are pretty close analogs to the same types of settings for our extractions, as specified in the run_deck dictionary. Most all of this code will migrate to the new build module.

We begin by importing our necessary modules and setting our file settings for the build:

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service or local files. The example below is based on using local files, which given the complexity of the routines that are emerging, is probably your better choice. Make sure to modify the URIs for your local directories.
from owlready2 import * 
from cowpoke.config import *
# from cowpoke.__main__ import *
import csv                                                
import types

world = World()

kb_src = every_deck.get('kb_src')                         # we get the build setting from config.py
#kb_src = 'standard'                                      # we can also do quick tests with an override

if kb_src is None:
    kb_src = 'standard'
if kb_src == 'sandbox':
    kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
elif kb_src == 'standard':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
elif kb_src == 'start':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
else:
    print('You have entered an inaccurate source parameter for the build.')
skos_file = 'http://www.w3.org/2004/02/skos/core' 
    

We load our ontologies into owlready2 and set our namespaces:

kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

#skos = world.get_ontology(skos_file).load()
#kb.imported_ontologies.append(skos)
#core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Since we’ve cleared memory and our workspace, we again add back in our new row_clean helper function:

def row_clean(value, iss=''):                                # arg values come from calling code
    if iss == 'i_id':                                        # check to see which replacement method
        value = value.replace('http://kbpedia.org/kko/rc/', 'rc.')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        return value                                         # returns the calculated value to calling code
    if iss == 'i_id_frag':
        value = value.replace('http://kbpedia.org/kko/rc/', '')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', '')
        return value
    if iss == 'i_parent':
        value = value.replace('http://kbpedia.org/kko/rc/', 'rc.')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        value = value.replace('owl:', 'owl.')
        return value
    if iss == 'i_parent_frag':
        value = value.replace('http://kbpedia.org/kko/rc/', '')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', '')
        value = value.replace('owl:', '')
        return value

Running the Complete Class Build

And then add our class build template to our new routine for iterating over all of our class input build files. CAUTION: to process all inputs to KBpedia, best done with the single assignment of the Generals typology (since all other typologies not already included in KKO are children of it), takes about 70 min on a conventional desktop.

You may notice that we made some slight changes to named variables in the draft template developed in the last installment:

  • src_filein_file
  • csv_fileinput

And, we have placed it into a defined function, class_struct_builder:

def class_struct_builder(**build_deck):                                    # Note 1
    print('Beginning KBpedia class structure build . . .')                 # Note 5
    kko_list = typol_dict.values()                                         # Note 2
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)                              # Note 5
        frag = loopval.replace('kko.','')
        in_file = (base + frag + ext)
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
            for row in reader:
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')                           # Note 3
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                with rc:                                                
                    kko_id = None
                    kko_frag = None
                    if parent_frag == 'Thing':                                                        
                        if id in kko_list:                                
                            kko_id = id
                            kko_frag = id_frag
                        else:    
                            id = types.new_class(id_frag, (Thing,))       
                if kko_id != None:                                         
                    with kko:                                                
                        kko_id = types.new_class(kko_frag, (Thing,))  
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:                                                
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:
                    is_first_row = False
                    continue          
                with rc:
                    kko_id = None                                   
                    kko_frag = None
                    kko_parent = None
                    kko_parent_frag = None
                    if parent_frag is not 'Thing':
                        if id in kko_list:
                            continue
                        elif parent in kko_list:
                            kko_id = id
                            kko_frag = id_frag
                            kko_parent = parent
                            kko_parent_frag = parent_frag
                        else:   
                            var1 = getattr(rc, id_frag)               
                            var2 = getattr(rc, parent_frag)
                            if var2 == None:                            
                                continue
                            else:                                
                                var1.is_a.append(var2)
                if kko_parent != None:                                         
                    with kko:                
                        if kko_id in kko_list:                               
                            continue
                        else:
                            var1 = getattr(rc, kko_frag)
                            var2 = getattr(kko, kko_parent_frag)                     
                            var1.is_a.append(var2)
        with open(in_file, 'r', encoding='utf8') as input:                 # Note 4
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:                                              
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:
                    is_first_row = False
                    continue
                if parent_frag == 'Thing': 
# This is the new code section, replacing the commented out below          # Note 4                   
                    var1 = getattr(rc, id_frag)
                    var2 = getattr(owl, parent_frag)
                    try:
                        var1.is_a.remove(var2)
                    except Exception:
#                        var1 = getattr(kko, id_frag)
#                        print(var1)
#                        var1.is_a.remove(owl.Thing)
#                        print('Last step in removing Thing')
                        continue
#                    print(var1, var2)
#                    if id in thing_list:                                     
#                        continue
#                    else:
#                        if id in kko_list:                                    
#                            var1 = getattr(kko, id_frag)
#                            thing_list.add(id)
#                        else:                                                 
#                            var1 = getattr(rc, id_frag)
#                            var2 = getattr(owl, parent_frag)
#                            if var2 == None:
#                                print('Empty Thing:')
#                                print('var1:', var1, 'var2:', var2)                            
#                            try:
#                                var1.is_a.remove(var2)
#                            except ValueError:
#                                print('PROBLEM:')
#                                print('var1:', var1, 'var2:', var2)                
#                                if len(thing_list) == 0:
#                                    print('thing_list is empty.')
#                                else:
#                                    print(*thing_list)
#                                break
#                        print(var1, var2)
#                        thing_list.append(id)
#                        thing_list.add(id)
    out_file = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/build_stop.csv'
    with open(out_file, 'w', encoding='utf8') as f:
            print('KBpedia class structure build is complete.')
            f.write('KBpedia class structure build is complete.')                # Note 5
            f.close()

Our function call pulls up the same keyword argument passing that we discussed for the extraction routines earlier (1). The double asterisk (**build_deck) argument means to bring in any of that dictionary’s keyword values if referenced in the routine. We can readily pick up loop or lookup specifications by referencing a dictionary (2). The kko_list is a handy one since it gives us a basis for selecting between KKO objects and the reference concepts (RCs) in KBpedia. The revised routine above also brings in our new helper function (3).

Pretty much the next portions of the routine are as described in the last installment, until we come up to Pass #3 (4), which is where we hit a major roadblock (coming up around the next bend in the road). We also added some print statements (5) that give feedback when the routine is running.

To run this file locally you will need to have the cowpoke project installed and know where to find your build_ins/typology directory. You also need to make sure your settings in config.py are properly set for your conditions. Assuming you have done so, you can invoke this routine (best with only a subset of your typology dictionary, assigned to, say, custom_dict:

class_struct_builder(**build_deck)

Realize everything has to be configured properly for this code to run. You will need to review earlier installments if you run into problems. Assuming you have gotten it to run to completion without error, you may want to then save it. We need to preface our ‘save’ statement with the ‘kb’ ontology identifier. I also have chosen to use the ‘working’ directory for saving these temporary results:

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl', format='rdfxml') 

However, I ran into plenty of problems myself. Indeed, the large code block commented out above (4) caused me hours of fits trying to troubleshoot and get the routine to act as I wanted. This whole effort put up a roadblock in my plan, sufficient that I had to add another installment. I explain this detour next.

A Brief History of Going Crazy

If we set as an objective being able to specify multiple input files for a current build, a couple of issues immediately arise. Recall, we designed our typology extraction files to be self-contained, which means that every class used as an object must also be declared as its own class subject. To speed up our extractions, we do not keep track of the many objects so needing definitions. That means each encounter triggers the need for another class definition. Multiple duplicate declarations do not cause a problem when loading the ontology, but when used as a specification input when doing multiple passes some tricky problems arise.

One obvious contributor to the difficulty is the need to identify and separately keep track of (and sometimes differentially process) our ‘kko’ and ‘rc’ namespaces. We need to account for this distinction in every loop and every assignment or removal that we make to the ontology while building it in memory. That can all be trapped for when in the class build cycle, which is the first two passes of the routine (first create the class, second add to parents), but gets decidedly tricky when removing the excess owl:Thing declarations.

To appreciate this issue a bit, here is the basic statement for removing a ‘Bird’ class from a parent ‘Reptile’:

rc.Bird.is_a.remove(rc.Reptile)

Our inputs can not be strings, but in loops variables often become so, and need to be evaluated to their type via the var1.is_a.getattr(rc, var2)

Unfortunately, when we make a rc.Bird.is_a.remove(rc.Reptile) request once it has been previously removed, the relationship is empty and owlready2 throws an error (as does Python when trying to remove an undeclared object). So, while we are able to extract without keeping track, we eventually do when we come time to build. Thus, as each file is processed, we need to account for prior removals and make sure we do not make the request again.

The later part of the code listing above (4) kept processing most of the files well, but not when too many were processed. I had the curious error of seeing the routine fail on the first entry of some files. It appeared to me perhaps the list accumulator I was using to keep track of prior removals was limited in size in some manner (it is not) or some counter or loop was not being cleared or initialized in the right location. If it ain’t perfect, it don’t run.

As a newbie with no prior experience to fall back on, here are some of the things I looked at and tested in trying to debug this Pass #3 owl:Thing deletion routine:

  • memory – was it a memory problem? Well, there are some performance issues we continue with in the next installment, but, no, Python seems to grab the memory it needs and does (apparently) a fair job of garbage cleanup. It was also not a problem with the notebook memory
  • loops – there are lots of ways to initiate loops or iterate over different structures from lists, sets, dictionaries, length and counters, etc. How loops are called and incremented differ by the iterator type chosen. I suspect this is where the issue still resides, because I continue to not have a native feel for:
    • sets v lists
    • clearing before loops
    • referencing the right loops
  • using the right fragment – the interplay of namespaces with scope is also not yet intuitive to me. Sometimes it is important to use the namespace prefixed reference to an object, other times not so. I am still learning about scope
  • not much worried about syntax because REPL was always running
  • list length limitations – I discussed this one above, as was able to eliminate it as the source
  • indentations – it is sometimes possible to put what one thinks is the closing statement to a routine at the wrong indentation, so that it runs, but is not affecting the correct code block. In my debugging efforts so far I often find this a source of the problem, especially when there is too much complexity or editing of the code. This is another reason to generalize duplicate code
  • code statement placement in order – in a similar way, counters and loop initializations can easily be placed into the wrong spots. The routine often may run, but still not do what you think it is, and
  • many others – I’m a newbie, right?

It was so frustrating trying to get this correct because I could get most everything working like I wanted, but then perhaps the routine would fail in the midst of processing a long list or would complete, but, upon inspection, may have missed some items or treated them incorrectly.

What little I do know about such matters tells me to try to pinpoint and isolate the problem. When processing long lists, that means testing for possible error conditions and liberally sprinkling various print statements with different text and different echoing of current values to the screen. For example, in an else: condition of an if: statement, I might put a print like:

  print('In kko_list loop of None error trap:', var1, var2)

But pinpointing a problem does not indicate how to solve it, though it does help to narrow attention. I had done so in the routine above, but I was still erroring out of some files. Sometimes that would happen, but it was still unclear what the offending part might be. When Python errors like that, it provides an error message and trackback, but somethings that information is cryptic. The failure point may occur any time after the last message to screen. Again, I was being pricked by needles in the haystack, but I still had not specifically found and removed them.

Error Trapping

I knew from my Python reading that it had a fairly good exception mechanism. Since print() statements were only taking me so far, I decided I needed to bite the bullet (for the needle pricks in my hand!) and start learning more about error trapping.

The basic approach for allowing a program to continue to run when an error condition is met is through the Python exception. It basically looks like this kind of routine:

   statement1
statement2
try:
non_zero = statement1 / statement2
except exception:
print('Oops, dividing by 0!')
continue

I was exploring this more graceful way to treat errors when I realized, duh, that same approach also captured exactly what I was trying to accomplish with avoiding multiple deletions in the first place! That is, I could continue to ‘try’ to delete the next instance of the owl:Thing assigment, and if it had already been deleted (which caused it to throw an exception, that is, what I was trying to fix!), I could exit gracefully and move on. Further, this would allow me to embed specific print() statements at the exact point of failure.

After this aHa! I changed the code as shown above (4). I suspect it is a slow way to process the huge numbers I have, but it works. I will continue to look for better means, but at least with this approach I was able to move on with the project.

Still, whether for this reason or others not yet contemplated, once we start processing huge numbers with multiple KBpedia build files, I am seeing performance much slower than what I would like. We address those topics in the next installment, which will also cause us to detour still further before we can get back on track to completing our property structure additions to the build.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 22, 2020 at 10:19 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2379/cwpk-40-looping-and-multiple-structure-file-ingest/
The URI to trackback this post is: https://www.mkbergman.com/2379/cwpk-40-looping-and-multiple-structure-file-ingest/trackback/
Posted:September 21, 2020

Builds Are a More Complicated Workflow than Extractions

In installment CWPK #37 to this Cooking with Python and KBpedia series, we looked at roundtripping information flows and their relation to versions and broad directory structure. From a strict order-of-execution basis, one of the first steps in the build process is to check and vet the input files to make sure they are clean and conformant. Let’s keep that thought in mind, and it is a topic we will address in CWPK #45. But we are going to skip over that topic for now so that we do not add complexity to figuring out the basis of what we need from a backbone standpoint of a build.

By “build” I have been using a single term to skirt around a number of issues (or options). We can “build” our system from scratch at all times, which is the simplest conceptual one in that we go from A to Z, processing every necessary point in between. But, sometimes a “build” is not a wholesale replacement, but an incremental one. It might be incremental because we are simply adding, say, a single new typology or doing other interim updates. The routines or methods we need to write should accommodate these real-world use cases.

We also have the question of vetting our builds to ensure they are syntactically and logically coherent. The actual process of completing a vetted, ‘successful’ build may require multiple iterations and successively more refined tweaks to the input specifications in order to get our knowledge graphs to a point of ‘acceptance’. These last iterating steps of successive refinements all follow the same “build” steps, but ones which involve fewer and fewer fixes until ‘acceptance’ when the knowledge graph is deemed ready for public release. In my own experience following these general build steps over nearly a decade now, there may be tens or more “builds” necessary to bring a new version to public release. These steps may not be needed in your own circumstance, but a useful generic build process should anticipate them.

Further, as we explore these issues in some depth, we will find weaknesses or problems in our earlier extraction flows. We still want to find generic patterns and routines, but as we add these ones related to “builds” that will also cause some reachback into the assumptions and approaches we earlier used for extractions (CWPK #29 to CWPK #35). This installment thus begins with input-output (I/O) considerations from this “build” perspective.

Basic I/O Considerations

It seems like I spend much time in this series on code and directory architecting. I’m sure one reason is that it is easier for me to write than code! But, seriously, it is also the case that thinking through use cases or trying to capture prior workflows is the best way I know to write responsive code. So, in the case of our first leg regarding extraction, we had a fairly simple task: we had information in our knowledge graphs that we needed to get out in a usable file format for external applications (and roundtripping). If our file names and directory locations were not exactly correct, no big deal: We can easily manipulate these flat-text files and move them to other places as needed.

That is not so with the builds. First, we can only ingest our external files used in a build that can be read and “understood” by our knowledge graph. Second, as I touched on above, sometimes these builds may be total (or ‘full’ or ‘start’), sometimes they may be incremental (or ‘fix’). We want generic input-and-output routines to reflect these differences.

In a ‘full’ build scenario, we need to start with a core, bootstrap, skeletal ontology to which we add concepts (classes) and predicates (properties) until the structure is complete (or nearly so), and then to add annotations to all of that input. In a ‘fix’ build scenario, we need to start with an already constructed ‘full’ graph, and then make modifications (updates, deletes, additions) to it. How much cleaning or logical testing we may do may vary by these steps. We also need to invoke and write files to differing locations to embrace these options.

To make this complexity a bit simpler, like we have done before, we will make a couple of simplifying choices. First, we will use a directory where all build input files reside, which we call build_ins. This directory is the location where we first put the files extracted from a prior version to be used as the starting basis for the new version (see Figure 2 in CWPK #37). It is also the directory where we place our starting ontology files, the stubs, that bootstrap the locations for new properties and classes to be added. We also place our fixes inputs into this directory.

Second, the result of our various build steps will generally be placed into a single sub-directory, the targets directory. This directory is the source for all completed builds used for analysis and extractions for external uses and new builds. It is also the source of the knowledge graph input when we are in an incremental update or ‘fix’ mode, since we desire to modify the current build in-progress, not always start from scratch. The targets directory is also the appropriate location for logging, statistics, and working ‘scratchpad’ subdirectories while we are working on a given build.

To this structure I also add a sandbox directory for experiments, etc., that do not fall within a conventional build paradigm. The sandbox material can either be total scratch or copied manually to other locations if there is some other value.

Please see Figure 2 in CWPK #37 to see the complete enumeration of these directory structures.

Basic I/O Routines

Similar to what we did with the extraction side of the roundtrip, we will begin our structural builds (and the annotation ones two installments hence) in the interactive format of Jupyter Notebook. We will be able to progress cell-by-cell Running these or invoking them with the shift+enter convention. After our cleaning routines in CWPK #45, we will then be able to embed these interactive routines into build and clean modules in CWPK #47 as part of the cowpoke package.

From the get-go with the build module we need to have a more flexible load routine for cowpoke that enables us to specify different sources and targets for the specific build, the inputs-outputs, or I/O. We had already discovered in the extraction routines that we needed to bring three ontologies into our project namespace, KKO, the reference concepts of KBpedia, and SKOS. We may also need to differentiate ‘start’ v ‘fix’ wrinkles in our builds. That leads to three different combinations of source and target: ‘standard’ (same as ‘fixes’), ‘start’, and our optional ‘sandbox’) for our basic “build” I/O:

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service (the ‘raw’ version) or local files. The example below is based on using local files (though replace with your own local directory specification).
from owlready2 import * 
from cowpoke.config import *
# from cowpoke.__main__ import *
import csv                                                # we import all modules used in subsequent steps

world = World()

kko = []
kb = []
rc = []
core = []
skos = []
kb_src = build_deck.get('kb_src')                         # we get the build setting from config.py

if kb_src is None:
    kb_src = 'standard'
if kb_src == 'sandbox':
    kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
elif kb_src == 'standard':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
elif kb_src == 'start':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
else:
    print('You have entered an inaccurate source parameter for the build.')
skos_file = 'http://www.w3.org/2004/02/skos/core' 
    

(NOTE: We later add an ‘extract’ option to the above to integrate this with our earlier extraction routines.)

As I covered in CWPK #21, two tricky areas in this project are related to scope. The first tricky area relates to the internal Python scope of LEGB, which stands for local → enclosed → global → built-in and means that objects declared on the right are available to the left, but not left to right for the arrows shown. Care is thus needed about how information gets passed between Python program components. So, yes, a bit of that trickiness is in play with this installment, but the broader issues pertain to the second tricky area.

The second area is the interplay of imports, ontologies, and namespaces within owlready2, plus its own internal ‘world’ namespace.

I have struggled to get these distinctions right, and I’m still not sure I have all of the nuances down or correct. But, here are some things I have learned in cowpoke.

First, when loading an ontology, I give it a ‘world’ namespace assigned to the ‘World’ internal global namespace for owlready2. Since I am only doing cowpoke-related development in Python at a given time, I can afford to claim the entire space and perhaps lessen other naming problems. Maybe this is superfluous, but I have found it to be a recipe that works for me.

Second, when one imports an ontology into the working ontology (declaring the working ontology being step one), all ontologies available to the import are available to the working ontology. However, if one wants to modify or add items to these imported ontologies, each one needs to be explicity declared, as is done for skos and kko in our current effort.

Third, it is essential to declare the namespaces for these imports under the current working ontology. Then, from that point forward, it is also essential to be cognizant that these separate namespaces need to be addressed explicitly. In the case of cowpoke and KBpedia, for example, we have classes from our governing upper ontology, KKO (also with namespace ‘kko‘) and the reference concepts of the full KBpedia (namespace ‘rc‘). More than one namespace in the working ontology does complicate matters quite a bit, but that is also the more realistic architecture and design approach. Part of the nature of semantic technologies is to promote interoperability among multiple knowledge graphs or ontologies, each of which will have at least one of its own namespaces. To do meaningful work across ontologies, it is important to understand these ontology ← → namespace distinctions.

This is how these assignments needed to work out for our build routines based on these considerations:

kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')                # need to make sure we set the namespace

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')       # need to assign namespace to main onto ('kb')

Now that we have set up our initial build switches and defined our ontologies and related namespaces, we are ready to construct the code for our first build attempt. In this instance, we will be working with only a single class structure input file to the build, typol_AudioInfo.csv, which according to our ‘start’ build switch (see above) is found in the kbpedia/v300/build_ins/typologies/ directory under our project location.

The routine below needs to go through three different passes (at least as I have naively specified it!), and is fairly complicated. There are quite a few notes below the code listing explaining some of these steps. Also note we will be definining this code block as a function and the import types statement will be moved to the header in our eventual build module:

import types

src_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/typologies/typol_AudioInfo.csv'
kko_list = typol_dict.values()
with open(src_file, 'r', encoding='utf8') as csv_file:                 # Note 1
    is_first_row = True
    reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
    for row in reader:                                                 ## Note 2: Pass 1: register class
        id = row['id']                                                 # Note 3
        parent = row['parent']                                         # Note 3
        id = id.replace('http://kbpedia.org/kko/rc/', 'rc.')           # Note 4
        id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        id_frag = id.replace('rc.', '')
        id_frag = id_frag.replace('kko.', '')
        parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') 
        parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        parent = parent.replace('owl:', 'owl.')
        parent_frag = parent.replace('rc.', '')
        parent_frag = parent_frag.replace('kko.', '')
        parent_frag = parent_frag.replace('owl.', '')
        if is_first_row:                                               # Note 5
            is_first_row = False
            continue      
        with rc:                                                       # Note 6
            kko_id = None
            kko_frag = None
            if parent_frag == 'Thing':                                 # Note 7                               
                if id in kko_list:                                     # Note 8
                    kko_id = id
                    kko_frag = id_frag
                else:    
                    id = types.new_class(id_frag, (Thing,))            # Note 6
        if kko_id != None:                                             # Note 8
            with kko:                                                  # same form as Note 6
                kko_id = types.new_class(kko_frag, (Thing,))  
with open(src_file, 'r', encoding='utf8') as csv_file:
    is_first_row = True
    reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
    for row in reader:                                                 ## Note 2: Pass 2: assign parent
        id = row['id']
        parent = row['parent']
        id = id.replace('http://kbpedia.org/kko/rc/', 'rc.')           # Note 4
        id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        id_frag = id.replace('rc.', '')
        id_frag = id_frag.replace('kko.', '')
        parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') 
        parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        parent = parent.replace('owl:', 'owl.')
        parent_frag = parent.replace('rc.', '')
        parent_frag = parent_frag.replace('kko.', '')
        parent_frag = parent_frag.replace('owl.', '')
        if is_first_row:
            is_first_row = False
            continue          
        with rc:
            kko_id = None                                              # Note 9
            kko_frag = None
            kko_parent = None
            kko_parent_frag = None
            if parent_frag is not 'Thing':                             # Note 10
                if parent in kko_list:
                    kko_id = id
                    kko_frag = id_frag
                    kko_parent = parent
                    kko_parent_frag = parent_frag
                else:   
                    var1 = getattr(rc, id_frag)                        # Note 11
                    var2 = getattr(rc, parent_frag)
                    if var2 == None:                                   # Note 12
                        continue
                    else:
                        var1.is_a.append(var2)                         # Note 13
        if kko_parent != None:                                         # Note 14        
            with kko:                
                if kko_id in kko_list:                                 # Note 15
                    continue
                else:
                    var1 = getattr(rc, kko_frag)                       # Note 16
                    var2 = getattr(kko, kko_parent_frag)
                    var1.is_a.append(var2)
thing_list = []                                                        # Note 17
with open(src_file, 'r', encoding='utf8') as csv_file:
    is_first_row = True
    reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
    for row in reader:                                                 ## Note 2: Pass 3: remove owl.Thing
        id = row['id']
        parent = row['parent']
        id = id.replace('http://kbpedia.org/kko/rc/', 'rc.')           # Note 4
        id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        id_frag = id.replace('rc.', '')
        id_frag = id_frag.replace('kko.', '')
        parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') 
        parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        parent = parent.replace('owl:', 'owl.')
        parent_frag = parent.replace('rc.', '')
        parent_frag = parent_frag.replace('kko.', '')
        parent_frag = parent_frag.replace('owl.', '')
        if is_first_row:
            is_first_row = False
            continue
        if parent_frag == 'Thing':                                     # Note 18
            if id in thing_list:                                       # Note 17
                continue
            else:
                if id in kko_list:                                     # Note 19
                    var1 = getattr(kko, id_frag)
                    thing_list.append(id)
                else:                                                  # Note 19
                    var1 = getattr(rc, id_frag)
                    var1.is_a.remove(owl.Thing)
                    thing_list.append(id)

The code block above was the most challenging to date in this CWPK series. Some of the lessons from working this out are offered in CWPK #21. Here are the notes that correspond to some of the statements made in the code above:

  1. This is a fairly standard CSV processing routine. However, note the ‘fieldnames’ that are assigned, which give us a basis as the routine proceeds to pick out individual column values by row

  2. Each file processed requires three passes: Pass #1 – registers each new item in the source file as a bona fide owl:Class; Pass #2 – each new item, now properly registered to the system, is assigned its parent class; and Pass #3 – each of the new items has its direct assignment to owl:Class removed to provide a cleaner hierarchy layout

  3. We are assigning each row value to a local variable for processing during the loop

  4. In this, and in the lines to follow, we are reducing the class string and its parent string from potentially its full IRI string to prefix + Name. This gives us the flexibility to have different format input files. We will eventually pull this repeated code each loop out into its own function

  5. This is a standard approach in CSV file processing to skip the first header row in the file

  6. There are a few methods apparently possible in owlready2 for assigning a class, but this form of looping over the ontology using the ‘rc‘ namespace is the only version I was able to get to work successfully, with the assignment statement as shown in the second part of this method. Note the assignment to ‘Thing’ is in the form of a tuple, which is why there is a trailing comma

  7. Via this check, we only pick up the initial class declarations in our input file, and skip over all of the others that set actual direct parents (which we deal with in Pass #2)

  8. We check all of our input roles to see if the row class is already in our kko dictionary (kko_list, set above the routine) or not. If it is a kko.Class, we assign the row information to a new variable, which we then process outside of the ‘rc’ loop so as to not get the namespaces confused

  9. Initializing all of this loops variables to ‘None’

  10. Same processing checks as for Pass #1, except now we are checking on the parent values

  11. This is an owlready2 tip, and a critical one, for getting a class type value from a string input; without this, the class assignment method (Note 13) fails

  12. If var2 is not in the ‘rc‘ namespace (in other words, it is in ‘kko‘, we skip the parent assignment in the ‘rc‘ loop

  13. This is another owlready2 method for assigning a class to a parent class. In this loop given the checks performed, both parent and id are in the ‘rc‘ namespace

  14. As for Pass #1, we are now processing the ‘kko‘ namespace items outside of the ‘rc‘ namespace and in its own ‘kko‘ namespace

  15. We earlier picked up rows with parents in the ‘kko‘ namespace; via this call, we also exclude rows with a ‘kko‘ id as well, since our imported KKO ontology already has all kko class assignments set

  16. We use the same parent class assignment method as in Note #11, but now for ids in the ‘rc‘ namespace and parents in the ‘kko‘ namespace. However, the routine so far also results in a long listing of classes directly under owl:Thing root (1) in an ontology editor such as Protégé:

Class Import with Duplicate owl:Thing Assignments
Figure 1: Class Import with Duplicate owl:Thing Assignments
  1. We use a ‘thing_list’, and assign it as an empty set at the beginning of the Pass #3 routine, because we will be deleting class assignments to owl:Thing. There may be multiple declarations in our build file, but we only may delete the assignment once from the knowledge base. The lookup to ‘thing_list’ prevents us from erroring when trying to delete for a second or more times

  2. We are selecting on ‘Thing’ because we want to unassign all of the temporary owl:Thing class assignments needed to provide placeholders in Pass #1 (Note: recall in our structure extractor routines in CWPK #28 we added an extra assignment to add an owl:Thing class definition so that all classes in the extracted files could be recognized and loaded by external ontology editors)

  3. We differentiate between ‘rc‘ and ‘kko‘ concepts because the kko are defined separated in the KKO ontology, used as one of our build stubs.

As you run this routine in real time from Jupyter Notebook, you can inspect what have been removed by inspecting:

list(thing_list)

We can now inspect this loading of an individual typology into our stub. We need to preface our ‘save’ statement with the ‘kb’ ontology identifier. I also have chosen to use the ‘working’ directory for saving these temporary results:

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/build_ins/working/kbpedia_reference_concepts.owl', format="rdfxml") 

So, phew! After much time and trial, I was able to get this code running successfully! Here is the output of the full routine:

Class Import with Proper Hierarchical Placement
Figure 2: Class Import with Proper Hierarchical Placement

We can see that our flat listing under the root is now gone (1) and all concepts are properly organized according to the proper structure in the KKO hierarchy (2).

We now have a template for looping over multiple typologies to contribute to a build as well as to bring in KBpedia’s property structure. These are the topics of our next installment.

Additional Documentation

There were dozens of small issues and problems that arose in working out the routine above. Here are some resources that were especially helpful in informing that effort:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 21, 2020 at 10:16 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2378/cwpk-39-i-o-and-structural-ingest/
The URI to trackback this post is: https://www.mkbergman.com/2378/cwpk-39-i-o-and-structural-ingest/trackback/
Posted:September 17, 2020

This Installment Defines KBpedia’s ‘Bootstraps’

We begin the build process with this installment in the Cooking with Python and KBpedia series. We do so by creating the ‘bootstraps‘ of ‘core’ ontologies that are the targets as we ingest new classes, properties, and annotations for KBpedia. The general process we outline herein is appropriate to building any large knowledge graph. You may swap out your own starting ontology and semantic scaffoldings to apply this process to different knowledge graphs.

The idea of a ‘bootstrap’ in computer science means a core set of rudimentary instructions that is called at immediate initialization of a program. This bootstrapped core provides all of the parent instructions that are called by the subsequent applications that actually do the desired computer tasks. The bootstrap is the way those applications can perform basic binary operations like allocating registers, creating files, pushing or popping instructions to the stack, and other low-level functions.

In the case of KBpedia and our approach to the build process for knowledge graphs, the ‘bootstrap’ is the basic calls to the semantic languages such as RDF or OWL and the creation of a top-level parental set of classes and properties to which we connect the subsequent knowledge graph content. We call these starting bootstraps ‘stubs’.

These ‘stubs’ are created outside of the build process, generally using an ontology IDE like Protégé. In our case, we have already created the ‘stubs’ used in the various KBpedia build processes. As we create new versions, we must make some minor modifications to these ‘stubs’. However, in general, the stubs are rather static in nature and may only rarely need to be changed in a material manner. As you will see from inspection, these stubs are minimal in structure and rather easy to create on your own with your own favorite ontology editor.

The KBpedia build processes use one core ontology stub, the KBpedia Knowledge Ontology (KKO) and two supporting stubs for use in building the full KBpedia knowledge graph or individual typologies.

Overview of the Build Process

We set up a new directory structure with appropriate starting files as the first activity. The build first starts with a pre-ingest step of checking out input files for proper encoding and other ‘cleaning’ tests. Upon passing these checks, we are ready to continue with the build.

The build process begins by loading the stub. This loaded stub then becomes the target for all subsequent ingest steps.

The ingest process has two phases. In the first phase we ingest build files that specify the structural nature of the knowledge graph, in this case, KBpedia. This structural scaffolding consists of, first, class statements, and then object property or data property ‘is-a’ statements. In the case of classes, the binding predicate is the rdfs:subClassOf property. In the case of properties, it is the rdfs:subPropertyOf property.

This phase sets the structure over which we can reason and infer with the knowledge graph. Thus, we also have the optional steps in this phase to check whether our ingests have been consistent and satisfiable. If the structural scaffolding meets these tests, we are ready for the second phase.

The second phase is to bring in the many annotations that we have gathered for the classes and properties. A description and preferred label are requirements for each item. These are best supplemented with alternative labels (synonyms in the broadest sense) and other properties. We can then load either mapping or additional annotation properties should we desire them.

These steps are not inviolate. Files that we know are clean can skip the pre-clean steps, for example. Or, we may already have a completed and vetted knowledge graph to which we only want to supplement some information. In other words, the build routines can also be used in different orders and with only partial input sets once we have a working system.

Steps to Prep

We will assume that you have already done your offline work to add to or modify your build input files. (As we proceed installment-by-installment during this build discussion we will provide a listing of required files as appropriate.) Depending on the given project, working on these offline build files may actually represent the bulk of your overall efforts. You might be querying outside sources to add to annotations, or changing or adding to your knowledge graph’s structure, or trying new top-level ontologies, etc., etc.

Once you deem this offline work to be complete, you need to do some prep to support the new build process (which in the simplest case are the extraction files we just discussed in this CWPK series). Your first task is to create a new skeletal directory structure under a new version parent, similar to what is shown in Figure 2 in the prior CWPK #37 installment. One way to avoid typing in all new directory names is to copy a prior version directory, copy it to the new version location, and then delete irrelevant files. (Further, if you know you may do this multiple times, you may then copy this shell structure for later use for subsequent versions.)

You then need to copy over all of the prior stub files from the prior version to the new ‘stub’ directory. Depending on what you have been doing locally, you may need to make further changes to mirror your needed work preferences.

Each stub file then needs to be brought into an ontology editor (Protégé, of course, in our case) and updated for new version number, as this diagram indicates:

Making Version Changes to KKO
Figure 1: Making Version Changes to KKO

Note that every ontology has a base IRI, and you should update the reference or version number (http://kbpedia.org/kbpedia/v250 in our case) (1) in the ontology URI field. You then need to copy the text under your current owl:versionInfo annotation, and paste it into a new owl:priorVersion (2) annotation. You may need to make some minor editing changes to reflect past tense for the prior version. Then, last, you need to update the owl:versionInfo (3) annotation.

You may, of course, make other ontology metadata changes at this time.

KKO: The Core Stub

The KKO stub is the core one for the build process. It represents its own standalone ontology, but also is the top-level ontology used by KBpedia.

KKO is also the most likely of the three stubs to need modication before a new run. Recall that KKO is organized under three main branches corresponding to the universal categories of Charles Sanders Peirce. Two of the branches, Monads and Particulars, do not participate in a KBpedia build. (Though future version releases of KKO may affect these branches, in which case the KKO stub should be updated.) But the third branch, Generals, is very much involved in a KBpedia build. All roots (parents) of KBpedia’s typologies tie-in under the Generals branch.

You will need, then, to make changes to the Generals of KKO prior to starting a build if any of these conditions is met:

  1. You are dropping or removing any typologies or SuperTypes
  2. You are adding any typologies or SuperTypes.

If you are only modifying a typology, you need not change KKO. Loading the modified typology during the full build process will accomplish this modification.

Like the other two stubs, you also need to make sure you have updated your version references. As distributed with cowpoke as part of these CWPK installments, here is the KKO stub as used in this project (remember, to see the file chose Run from the notebook menu or press shift+enter when highlighting the cell:

Note: You may obtain the three ‘stub’ files used in this installment from https://github.com/Cognonto/CWPK/tree/master/sandbox/builds/stubs. Make sure and use the ones with the *.owl extension.
with open(r'C:\1-PythonProjects\kbpedia\v300\build_ins\stubs\kko.owl', 'r', encoding='utf8') as f:
    print(f.read())

The KBpedia Stub

The KBpedia stub is the ‘umbrella’ above the entire project. It incorporates the KKO stub, plus is the general target for all subsequent build steps in the full-build process. When looked at in code view, as the file below shows, this ‘umbrella’ is rather sparse. However, if you are to look at it in, say, Protégé, then you will also see all of KKO due to its being imported.

Again, the KBpedia stub should have its version updated prior to a new version build:

with open(r'C:\1-PythonProjects\kbpedia\v300\build_ins\stubs\kbpedia_rc_stub.owl', 'r', encoding='utf8') as f:
    print(f.read())

The Typology Stub

The typology stub is the simplest of the three. Its use is merely to provide a ‘header’ sufficient for loading an individual typology into an editor such as Protégé.

However, despite being listed last, it is the typology stub we will first work with in developing our build routines, because it is our simplest possible starting point. Again, assuming you have made your version updates, here is the file:

with open(r'C:\1-PythonProjects\kbpedia\v300\build_ins\stubs\typology_stub.owl', 'r', encoding='utf8') as f:
    print(f.read())

OK, so our stubs are now updated and set up. We are ready to begin some ingest coding . . . .

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 17, 2020 at 10:30 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2377/cwpk-38-stubs-and-starting-files/
The URI to trackback this post is: https://www.mkbergman.com/2377/cwpk-38-stubs-and-starting-files/trackback/
Posted:September 16, 2020

Refining Plans and Directories to Complete the Roundtrip

This installment begins a new major part in our Cooking with Python and KBpedia series. Starting with the format of our extraction files, which can come directly from our prior extraction routines or from other local editing or development efforts, we are now in a position to build a working KBpedia from scratch, and then to test it for consistency and satisfiability. This major part, of all parts in this CWPK series, is the one that most closely reflects our traditional build routines of KBpedia using Clojure.

But this current part is also about more than completing our roundtrip back to KBpedia. For, in bringing new assignments to our knowledge graph, we must also test it to ensure it is encoded properly and that it performs as promised. These additional requirements also mean we will be developing more than a build routine module in this part. The way we are structuring this effort will also add a testing module and a cleaning module (for checking encodings and the like). There are a dozen installments, including this one, in this part to cover this ground.

The addition of more modules, with the plan of still more to come thereafter, also compels us to look at how we are architecting our code and laying out our files. Thus, besides code development, we need to pay attention to organizational matters as well.

Starting a New Major Part

I am pretty pleased with how the cowpoke extraction module turned out, so will be following something of the same pattern to build KBpedia in this part. Since we made the early call to bootstrap our project from the small, core top-level KBpedia Knowledge Ontology (KKO), we gained alot of simplification. That is a good trade-off, since KKO itself is a value-neutral top-level ontology built from the semiotic perspective of C.S. Peirce regarding knowledge representation. Our basic design can also be adopted to any other top-level ontology. If that is your desire, how to bring in a different TLO is up to you to figure out, though I hope that would be pretty easy following the recipes in these CWPK installments.

Fortunately as we make the turn to build routines in this part of the roundtrip, we are walking ground that we have been traveling for nearly a decade. We understand the build process and we understand the testing and acceptance criteria necessary to result in quality, publicly-released knowledge graphs. We try to bring these learnings to our functions in this part.

But as I caution at the bottom of each of these installments, I am learning Python myself through this process. I am by no means a knowledgeable programmer, let alone an expert. I am an interested amateur who has had the good fortune to have worked with some of the best developers imaginable, and I only hope I picked up little bits of some good things here and there about how to approach a coding project. Chances are great you can improve on the code offered. I also, unfortunately, do not have the coding experience to provide commercial-grade code. Errors that should be trapped are likely not, cases that need to be accommodated are likely missed, and generalization and code readability is likely not what it could be. I’ll take that, if the result is to help others walk these paths at a better and brisker pace. Again, I hope you enjoy . . .

Organizing Objectives

I have learned some common-sense lessons through the years about how to approach a software project. One of those lessons, obvious through this series itself, is captured by John Bytheway’s quote, “Inch by inch, life’s a cinch. Yard by yard, life’s hard.” Knowing where you want to go and taking bite-sized chunks to get there almost always leads to some degree of success if you are willing to stick with the journey.

Another lesson is to conform to community practice. In the case of Python (and most modern languages, I assume), applications need to be ‘packaged’ in certain ways such that they can be readily brought into the current computing environment. From the source-code perspective, this means conforming to the ‘package’ standard and the organization of code into importable modules. All of this suggests a code base that is kept separate from any project that uses it, and organized and packaged in a way similar to other applications in that language.

An interesting lesson about knowledge graphs is that they are constantly changing — and need to do so. In this regard, knowledge artifacts are not terribly different than software artifacts. Both need to be updated frequently such that versioning and version control are essential. Versioning tells users the basis for the artifact; version control helps to maintain the versions and track differences between releases. Wherever we decide to store our artifacts, they should be organized and packaged such that different versions may be kept integral. Thus, we organize our project results under version number labels.

Then, in terms of this particular project where roundtripping is central and many outputs are driven from the KBpedia knowledge structure, I also thought it made sense to establish separate tracks between inputs (the ‘build’ side) and outputs (the ‘extraction’ side) and to seek parallelisms between the tracks where it makes sense. This informs much of the modular architecture put forward.

All of this needs to be supplemented with utilities for testing, logging, error trapping and messaging, and statistics. These all occur at the time of build or use, and so belong to this leg of our roundtrip. We will not develop code for all of these aspects in this part of the CWPK series, but we will try to organize in anticipation of these requirements for a complete project. We’ll fill in many of those pieces in major parts to come.

Anticipated Directory Structures

These considerations lead to two different directory structures. The first is where the source code resides, and is in keeping with typical approaches to Python packages. Two modules are in-progress, at least at the design level, to complete our roundtripping per the objectives noted above. Two modules in logging and statistics are likely to get started in this current part. And another three are anticipated for efforts with KBpedia to come before we complete this CWPK series. Here is that updated directory structure, with these new modules noted in red:

|-- PythonProject                                              
|-- Python
|-- [Anaconda3 distribution]
|-- Lib
|-- site-packages
|-- [many]
|-- cowpoke
|-- __init__.py |-- __main__.py |-- analytics.py # anticipated new module |-- build.py # in-progress module |-- clean.py # in-progress module |-- config.py |-- embeddings.py# anticipated new module
|-- extract.py |-- graphics.py # anticipated new module |-- logs.py # likely new module |-- stats.py # likely new module |-- utils.py # likely new module
|-- More
|-- More
Figure 1: cowpoke Source Code (‘Package’) Directory Structure

In contrast, we need a different directory structure to host our KBpedia project, in which inputs to building KBpedia (‘build_ins’) are in one main branch, the results of a build (‘targets) are in another main branch, ‘extractions’ from the current version in a third, and a fourth (‘outputs’) are the results of post-build use of the knowledge graph. Further, these four main branches are themselves listed under their respective version number. This design means individual versions may be readily zipped or shared on GitHub in a versioned repository.

(NB: The ‘sandbox’ directory below we have referenced many times in this CWPK series, and is unique to it. It houses some of the example starting files needed for this series. We will continue to use the ‘sandbox’ as one of our main directory options.)

As of this current stage in our work, here then is how the project-related directory structures currently look for KBpedia:

|-- PythonProject 
|-- kbpedia
|-- sandbox
|-- v250
|-- etc.
|-- v300
|-- build_ins
|-- classes
|-- classes_struct.csv
|-- classes_annot.csv
|-- fixes
|-- TBD
|-- TBD
|-- mappings
|-- etc.
|-- etc.
|-- ontologies
|-- kbpedia-reference-concepts.owl
|-- kko.owl
|-- properties
|-- annotation_properties
|-- data_properties
|-- object_properties
|-- stubs
|-- kbpedia-reference-concepts.owl
|-- kko.owl
|-- typologies
|-- ActionTypes
|-- Agents
|-- etc.
|-- working
|-- TBD
|-- TBD
|-- extractions
|-- classes
|-- classes_struct.csv
|-- classes_annot.csv
|-- mappings
|-- etc.
|-- etc.
|-- properties
|-- annotation_properties
|-- data_properties
|-- object_properties
|-- typologies
|-- ActionTypes
|-- Agents
|-- etc.
|-- outputs
|-- analytics
|-- TBD
|-- TBD
|-- embeddings
|-- TBD
|-- TBD
|-- training_sets
|-- TBD
|-- TBD
|-- targets
|-- logs
|-- TBD
|-- TBD
|-- mappings
|-- etc.
|-- etc.
|-- ontologies
|-- kbpedia-reference-concepts.owl
|-- kko.owl
|-- stats
|-- TBD
|-- TBD
|-- typologies
|-- ActionTypes
|-- Agents
|-- etc.
|-- Python
|-- etc.
|-- etc.
|-- More
Figure 2: KBpedia Project Directory Structure (by version)

You will notice that nearly all ‘extractions’ categories are in the ‘build’ categories as well, reflecting the roundtrip nature of the design. Some of the output categories remain a bit speculative. This area is likely the one to see further refinement as we proceed.

Some of the directories shown, such as ‘analytics’, ’embeddings’, ‘mappings’, and ‘training_sets’ are placeholders for efforts to come. One directory, ‘working’, is a standard one we have adopted over the years to place all of the background working files (some of an intermediate nature leading to the formal build inputs) in one location. Thus, as we progress version-to-version, we can look to this directory to help remind us of the primary activities and changes that were integral to that particular build. When it comes time for a public release, we may remove some of these working or intermediate directories to what is published at GitHub, but we retain this record locally to help document our prior work.

Overall, then, what we have is a build that begins with extractions from a prior build or starting raw files, with those files being modified within the ‘build_ins’ directory during development. Once the build input files are ready, the build processes are initiated to write the new knowledge graph to the ‘targets’ directory. Once the build has met all logic and build tests, it is then the source for new ‘extractions’ to be used in a subsequent build, or to conduct analysis or staging of results for other ‘outputs’. Figure 3 is an illustration of this workflow:

Workflow by Version and Directory Structure
Figure 3: Workflow by Version and Directory Structure

A new version thus starts by copying the directory structure to a new version branch, and copying over the extractions and stubs from the prior version.

We now have the framework for moving on to the next installment in our CWPK series, wherein we begin the return leg of the roundtrip, what is shown as the ‘build_ins’ → ‘targets’ path in Figure 3.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 16, 2020 at 10:41 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2376/cwpk-37-organizing-the-code-base/
The URI to trackback this post is: https://www.mkbergman.com/2376/cwpk-37-organizing-the-code-base/trackback/
Posted:September 15, 2020

vlookup is Your Friend

One major benefit of large-scale extracts from KBpedia (or any knowledge graph, for that matter) is to produce bulk files that may be manipulated offline more effectively than working directly with the ontology or with an ontology IDE like Protégé. We address bulk manipulation techniques and tips in this current installment in the Cooking with Python and KBpedia series. This current installment also wraps up our mini-series on the cowpoke extraction module as well as completes our third major part on extraction and module routines in our CWPK series.

Typically, during the course of a major revision to KBpedia, I tend to spend more time working on offline files than in directly working with an ontology editor. However, now that we have our new extraction routines working to our liking, I can also foresee adding new, flexible steps to my workflow. With the extraction routines, I now have the choice of making changes directly in Protégé OR in bulk files. Prior to this point with our Clojure build codes, all such changes needed to be made offline in the bulk files. Now that we can readily extract changes made directly within an ontology editor, we have gained much desired flexibility. This flexibility also means we may work off of a central representation that HTML Web forms may interact with and modify. We can now put our ontologies directly in the center of production workflows.

These bulk files, which are offline comma-separated value (CSV) extraction files in our standard UTF-8 encoding, are well suited for:

  • Bulk additions
  • Bulk deletions
  • Bulk re-factoring, including modularization
  • Consistent treatment of annotations
  • Staging mapping files
  • Acting as consolidation points for new datasets resulting from external queries or databases, and
  • Duplicates identification or removal.

In the sections below I discuss preliminaries to working with bulk files, use of the important vlookup function in spreadsheets, and miscellaneous tips and guidance for working with bulk files in general. There are hundreds of valuable references on these topics on the Web. I conclude this installment with a few (among many) useful references for discovering more about working with CSV files.

Preliminaries to Working with Bulk Files

In the CWPK #27 installment on roundtripping, I made three relevant points. First, CSV files are a simple and easy flat-text format for flexibly interchanging data. Second, while there are conventions, there are no reliable standards for the specific format used, importantly for quoting text and using delimiters other than commas (which, if used, in longer text also needs to be properly ignored, or “escaped”). And, third, lacking standards, CSV files used for an internal project should adhere to their own standards, beginning with the UTF-8 encoding useful to international languages. You must always be mindful of these internal standards of comma delimitation, quoted long strings, and use of UTF-8.

The reason CSV is the most common data exchange format is likely due to the prevalence of Microsoft Excel, since CSV is the simplest flat-file option offered. Unfortunately, Excel does not do a good job of making CSV usable in standard ways and often imposes its own settings that can creep up in the background and corrupt files, especially due to encoding switches. One can obviously work with Excel to do these things since thousands do so and have made CSV as popular as it is. But for me, personally, working constantly with CSV files, I wanted a better approach.

I have found the open-source LibreOffice (what originally began as OpenOffice, which has subsequently been acquired by Oracle) to be a superior alternative for CSV purposes, sufficient for me to completely abandon MS Office. The next screen captures opening a file in LibreOffice and the three considerations that make doing so safe for CSV:

Opening a CSV File with LibreOffice
Figure 1: Opening a CSV File with LibreOffice

The first thing to look for is that the encoding is UTF-8 (1). There are actually multiple UTF options, so make sure and pick ‘-8’ v ‘-16’ or ‘-32’ options. Second, do not use fixed length for the input, but use delimiters (“separators”) using the comma and double-quoted strings (2). And third, especially when you are working with a new file, scan (3) the first records (up to 1000 may be displayed) in the open screen window so see if there are any encoding problems. If so, do not open the file, and see if you can look at the file in a straight text editor. You can follow these same steps in Excel, it is just out-of-the way to do so. LibreOffice always presents this screen for review when opening a CSV file.

I have emphasized these precautions because it is really, really painful to correct a corrupted file, especially ones that can grow to thousands of rows long. Thus, the other precaution I recommend is to frequently back up your files, and to give them date stamps in their file names (I append -YYYMMDD to the end of the file name because it always sorts in date order).

These admonishments really amount to best practices. These are good checks to follow and will save you potential heartache down the road. Take it from a voice of experience.

vlookup in Detail

Once in CSV form, our bulk files can be ingested into a spreadsheet, with all of the powers that brings of string manipulations, formatting, conditional changes, and block moves and activities. It is not really my intent to provide a tutorial on the use of spreadsheets for flat data files. There are numerous sources online that provide such assistance, and most of us have been working with spreadsheets in our daily work activities. I do want to highlight the most useful function available to work with bulk data files, vlookup, in this section, and then to offer a couple of lesser-known tips in the next. I also add some Additional Documentation in the concluding section.

vlookup is a method for mapping items (by matching and copying) in one block of items to the equivalent items in a different block. (Note: the practice of naming blocks of cells in a spreadsheet is a very good one for many other spreadsheet activities, which I’ll touch upon in the next section.) The vlookup mapping routine is one of the most important available to you since it is the method for integrating together two sets (blocks) of related information.

While one can map items between sheets using vlookup, I do not recommend it, since I find it more useful to see and compare the mapping results on one sheet. We illustrate this use of two blocks on a sheet with this Figure 2:

Named Blocks Support vlookup
Figure 2: Named Blocks Support vlookup

When one has a source block of information to which we want to map information, we first highlight our standard block of vetted information (2) by giving it a name, say ‘standard’, in the block range cell (1) to the immediate upper left from the spreadsheet. Besides normally showing the coordinates of the row and cell references in the highlighted block, we can also enter a name such as ‘standard’ into this cell. This ‘standard’, once typed in later in this same box (1) or picked from a named list of blocks, will cause our ‘standard’ block to be highlighted again. Then, we have potential information (3) that we want to ‘map’ to items in that ‘standard’ name block. As used here, by convention, ‘items’ MUST appear in the first column of the ‘map’ block (3), and only match other items found in any given column of the ‘standard’ (2) block. In the case of KBpedia and its files, the ‘standard’ block (2) is typically the information in one of our extraction files, to which we may want to ‘map’ another extraction file (3) or a source of external information (3).

(NB: A similar function called hlookup applies to rows v columns, but I never use it because our source info is all individuated by rows.)

(NB2: Of course, we can also map in the reverse order from ‘standard’ to ‘map’. Reciprocal mapping, for instance, is one way to determine whether both sets overlap in coverage or not.)

So, only two things need to be known to operate vlookup: 1) both source (‘standard’) and target (‘map’) need to be in named blocks; and 2) the items matched in the ‘standard’ block need to be in the first column of the ‘map’ block. Once those conditions are met, any column entry from the ‘map’ block may be copied to the cell where the vlookup function was called. Once you have the formula working as you wish, you then can copy that vlookup cell reference down all of the rows of the ‘standard’ block, thereby checking the mapping for the entire source ‘standard’ block.

When I set these up, I put the initial vlookup formula into an empty top cell to either the left or right of the ‘standard’ block, depending on whether the possibly matching item is on the left or right of the block. (It’s easier to see the results of the lookup that way.) Each vlookup only looks at one column in the ‘standard’ for items to match against the first column in the ‘map’, and then returns the value of one of the columns in the ‘map’.

The ‘map’ block may only be a single column, in which case we are merely checking for intersections (and therefore, differences) between the blocks. Thus, one quick way to check if two files returned the same set of results is to copy the identifiers in one source as a ‘map’ block to a ‘standard’ source. If, after testing the formula and then copying vlookup down all rows of the adjacent ‘standard’, and then we see values returned for all cells, we know that all of the items in the ‘standard’ block (2) are included in the items of the ‘map’ block (3).

Alternatively, the ‘map’ block may contain multiple columns, in which case what is in the column designated (1 to N) is the value of what gets matched and copied over. This approach provides a method, column by column, to add additional items to a given row record.

Here is the way the formula looks when entered into a cell:

  =VLOOKUP(A1,map,2,0)

In this example, A1 is the item to be looked up in the ‘standard’ block. If we copy this formula down all rows of the ‘standard’ block, all items tested for matches will be in column A. The map reference in the formula refers to the ‘map’ named block. The 2 (in reference to the 1 to N above) tells the formula to return the information in column 2 of ‘map’ if a match occurs in column 1 (which is always a condition of vlookup). The 0 is a flag in the formula indicating only an exact match will return a value. If no match occurs, the formula indicates #N/A, otherwise the value is the content of the column cell (2 in this case) matched from ‘map’.

If, after doing a complete vlookup I find the results not satisfactory, I can undo. If I find the results satisfactory, I highlight the entire vlookup column, copy it, and then paste it back into the same place with text and results only. This converts the formulas to actual transferred values and then I can proceed to next steps, such as moving the column into the block, adding some prefixes, fixing some string differences, etc. After incorporation of the accepted results, it is important to make sure our ‘standard’ block reflects the additional column information.

Particularly when dealing with annotations, where some columns may contain quite long strings, I do two things, occasioned by the fact that opening a CSV file causes column widths to adjust to the longest entry. First, I do not allow any of the cells to word wrap. This prevents rows becoming variable heights, which I find difficult to use. Second, I highlight the entire spreadsheet (via the upper left open header cell), and then set all columns to the same width. This solves the pain of scrolling left or right where some columns are too wide.

It takes a few iterations to get the hang of the vlookup function, but, once you do, you will be using it for many of the bulk activities listed in the intro. vlookup is a powerful way to check unions (do single-column lookups both ways), intersections, differences, duplicates, and the transfer of new values to incorporate into records.

Like other bulk activities, also be attentive to backups and saving of results as you proceed through multi-step manipulations.

Other General Spreadsheet Tips

Here are some other general tips for using spreadsheets, organized by topic.

Sorts

Named blocks are a good best practice, especially for sorts, which are a frequent activity during bulk manipulations. However, sorts done wrong have the potential to totally screw up your information. Remember, our extracts from KBpedia are, at minimum, a semantic triple, and in the case of annotation extractions, multiple values per subject. This extracted information is written out as records, one after another, row by row. The correspondence of items to one another, in its most basic form the s-p-o, is a basic statement or assertion. If we do not keep these parts of subject – verb – object together, our statements become gibberish. Let’s illustrate this by highlighting one record — one statement — in an example KBpedia extraction table:

Rows are Sacrosanct
Figure 3: Rows are Sacrosanct

However, if we are to sort this information by the object field in column C, we can see we have now broken our record, in the process making gibberish out of all of our statements:

Sorting without a Named Block
Figure 4: Sorting without a Named Block

We prevent the breakage in Figure 4 from occurring by making sure we never sort on single columns, but on entire blocks. We could still sort on column C but without breakage by first invoking our named ‘standard’ block (or whatever name to ‘standard’ we have chosen) before we enter our search parameters (see S & R further below).

Here’s another tip for ‘standard’ or master blocks: Add a column with a row sequence number for each row. This will enable you to re-sort on this column and restore the block’s original order (despite how other columns of the block may alphabetize). To create this index, put ‘1’ in the top cell, ‘1+C1’ in the cell below (assuming our index in in Col C), and copy it down all rows. Then copy the column, and paste it back in place with text + values only.

Duplicates

A quick way to find duplicates in a block is to have all of its subjects or identifiers in Col A, and sort the block. Then, in the column immediately to the left of the block, enter the =EXACT(B1,B2) formula in the cell (the two cells are to the immediate right and then one above that for the two arguments). If the content in B1 and B2 are exactly the same, the formula will evaluate to TRUE, if not FALSE. Copy that formula down all rows adjacent to the block.

Every row marked with TRUE is a duplicate with respect to Col B. If you want to remove these duplicates, copy the entire formula column, paste it back as text and values only, and then sort that column and your source block. You can then delete en masse all rows with duplicates (TRUE).

You can test for duplicate matter across columns with the same technique. Using the =CONCATENATE() operator, you may temporarily combine values from multiple columns. Create this synthetic concatenation in its own column, copy it down all block rows, and then test for duplicates with the =EXACT() operator as above.

Search and Replace

The search function in spreadsheets goes well beyond normal text and includes attributes (like bolding), structural characters like tabs or line feeds, or regular expressions (regex). Regex is a particularly powerful capability that few know, but unlocks tremendous power. However, an exploration of regex is beyond the scope of this CWPK series. I have found simple stuff like recognizing capitalization or conditional replacements to be very helpful, but basic understanding let alone mastery of regex requires a substantial learning commitment.

I use the search box below the actual spreadsheet for repeat search items. So, while I will use the search dialog for complicated purposes, I put the repeated search queries here. To screen against false matches, I also use the capitalization switch and also try to find larger substrings that embed the fragment I am seeking but removes adjacent text that fails my query needs.

Another useful technique is to only search within a selection, which is selected by a radiobutton on the search dialog. Highlighting a single column, for example, or some other selection boundary like a block, enables local replacements without affecting other areas of the sheet.

String Manipulations

One intimidating factor of spreadsheets is the number of functions they have. However, hidden in this library are many string manipulation capabilities, generally all found under the ‘Text’ category of functions. I have already mentioned =CONCATENATE() and =EXACT(). Other string functions I have found useful are =TRIM() (removes extra spaces), =CLEAN() (removes unprintable characters), =FIND() (find substrings, useful for flagging entries with shared characteristics), and =RIGHT() (testing the last character is a string). These kinds of functions can be helpful in cleaning up entries as well as finding stuff within large, bulk files.

There are quite a few string functions for changing case and converting formats, I tend to use these less than the many main menu options found under Format → Text.

These functions can often be combined in surprising ways. Here are two examples of string manipulations that are quite useful (may need to adjust cell references):

For switching person first and last names (where the target is in A17):

  =MID(A17,FIND(" ",A17)+1,1024)&", "&LEFT(A17,FIND(" ",A17)-1)

For singularizing most plurals (-ies to -y not covered, for example):

  =IF(OR(RIGHT(A1,1)="s",RIGHT(A1,2)="es"),IF(RIGHT(A1,2)="es",LEFT(A1,(LEN(A1)-2)),LEFT(A1,(LEN(A1)-1))),A1)

This does not capture all plural variants, but others may be added given the pattern.

Often a bit of online searching will turn up other gems, depending on what your immediate string manipulation needs may be.

Other

One very useful capability, but close to buried in LibreOffice, is the Data → Text to Columns option. It is useful to splitting a column into two or more cells based on a given character or attribute, useful to long strings or other jumbled content. Invoke the dialog on your own spreadsheet to see the dialog for this option. There are many settings for how to recognize the splitting character, each occurrence of which causes a new cell to be populated to the right. Thus, it is best to have the target column with the long strings at the far right of your block (since when it makes splits, it populates cells to the right, but only if the splitting condition is met. Thus, if existing columns of information exist to the right, they will become jagged and out of sync.

Data Representations in Python

We are already importing the csv module into cowpoke. However, there is a supplement to that standard that provides a bit more functionality called CleverCSV. I have not tested it. There is also a utility to combine CSV files with glob, which relates more to pandas and is also a utility I have not used.

Please note there are additional data representation tips involving Python in CWPK #21.

Tips for Some Other Tools

As we alluded to in CWPK #25, Wikipedia, DBpedia, and Wikidata are already mapped to KBpedia and provide rich repositories of instance data retrievable via SPARQL. (A later installment will address this topic.) The results sets from these queries may be downloaded as flat files that can be manipulated with all of these CSV techniques. Indeed, retrievals from these sources have been a key source for populating much of the annotation information already in KBpedia.

You can follow this same method to begin creating your own typologies or add instances or expand the breadth or depth of a given topic area. The basic process is to direct a SPARQL query to the source, download the results, and then manipulate the CSV file for incorporation into one of your knowledge graph’s extraction files for the next build iteration.

Sample Wikidata queries, numbering into the hundreds, are great studying points for SPARQL and sometimes templates for your own queries. I also indicated in CWPK #25 how the SPARQL VALUE statement may be used to list identifiers for bulk retrievals from these sources.

You can also use the Wikipedia and Wikidata Tools plug-in for Google spreadsheets to help populate tables that can be exported as CSV for incorporation. You should also check out OpenRefine for data wrangling tasks. OpenRefine is very popular with some practitioners, and I have used it on occasion when some of the other tools listed could not automate my task.

Though listed last, text editors are often the best tool for changes to bulk files. In these cases, we are now editing the flat file directly, and not through a column and row presentation in the spreadsheet. As long as we are cognizant and do not overwrite comma delimiters and quoted long strings, the separate text and control attributes such as tabs or carriage returns can be manipulated with the different functions these applications bring.

A Conclusion to this Part

The completion of this installment means we have made the turn on our roundtrip quest. We have completed our first extraction module and have explained a bit how we can modify and manipulate the bulk files that result from our extraction routines.

In our next major part of the CWPK series we will use what we have learned to lay out a more complete organization of the project, as well as to complete our roundtripping with the addition of build routines.

Additional Documentation

As noted, there are multiple sources from multiple venues to discuss how to use spreadsheets effectively, many with specific reference to CSV files. We also have many online sources that provide guidance on getting data from external endpoints using SPARQL, the mapping of results from which is one of the major reasons for making bulk modifications to our extraction files. Here are a few additional sources directly relevant to these topics:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 15, 2020 at 10:02 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2374/cwpk-36-bulk-modification-techniques/
The URI to trackback this post is: https://www.mkbergman.com/2374/cwpk-36-bulk-modification-techniques/trackback/