We Build Up Our Ingest Routine to All Structure
Now that we have a template for structure builds in this Cooking with Python and KBpedia series, we continue to refine that template to generalize the routine and expand it to looping over multiple input files and to apply it to property structure as well. These are the topics we cover in this current installment, with a detour as I explain below.
In order to prep for today’s material, I encourage you to go back and look at the large routine we developed in the last installment. We can see three areas we need to address in order to generalize this routine:
- First, last installment’s structure build routine (as designed) requires three passes to complete file ingest. Each one of those passes has a duplicate code section to convert our file input forms to required shorter versions. We would like to extract these duplicates as a helper function in order to lesson code complexity and improve readability
- Second, we need a more generic way of specifying the input file or files to be processed by the routine, preferably including being able to loop over and process all of the files in a given input dictionary (as housed in
- Third, we would like to generalize the approach to dealing with class hierarchical structure to also deal with property ingest and hierarchical structure.
So, with these objectives in mind, let’s begin.
Adding a Helper Function
For reference, here is the code block in the prior installment that we repeat three times, and for which we would like to develop a helper function (BTW, this code block will not run here in isolation):
id = row['id'] = row['parent'] parent id = id.replace('http://kbpedia.org/kko/rc/', 'rc.') id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.') = id.replace('rc.', '') id_frag = id_frag.replace('kko.', '') id_frag = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.') parent = parent.replace('owl:', 'owl.') parent = parent.replace('rc.', '') parent_frag = parent_frag.replace('kko.', '') parent_frag = parent_frag.replace('owl.', '')parent_frag
We will call our helper function
row_clean since its purpose is to convert the full IRIs of the CSV input rows to shorter forms required by owlready2 (sometimes object names with a namespace prefix, other times just with the shortened object name). We also need these to work on either the subject of the row (‘id’) or the object of the row (‘parent’ in this case). That leads to four combinations of 2 row objects by 2 shortened forms.
Note that the second argument (‘iss’) passed to the function below is a keyword argument, always shown with the equal sign in the function definition. Also note sometimes, rather than an empty string as shown, if you assign the keyword argument a legitimate value when defined, that becomes the default assignment for that keyword and does not have to have a value assigned to it when called. (NB: Indeed, many built-in Python functions have multiple arguments that are infrequently exposed. I have found it frequently helpful to do a
dir() on functions to discover their broader capabilities.)
### Here is the helper function def row_clean(value, iss=''): # arg values come from calling code if iss == 'i_id': # check to see which replacement method = value.replace('http://kbpedia.org/kko/rc/', 'rc.') value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.') value return value # returns the calculated value to calling code if iss == 'i_id_frag': = value.replace('http://kbpedia.org/kko/rc/', '') value = value.replace('http://kbpedia.org/ontologies/kko#', '') value return value if iss == 'i_parent': = value.replace('http://kbpedia.org/kko/rc/', 'rc.') value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.') value = value.replace('owl:', 'owl.') value return value if iss == 'i_parent_frag': = value.replace('http://kbpedia.org/kko/rc/', '') value = value.replace('http://kbpedia.org/ontologies/kko#', '') value = value.replace('owl:', '') value return value ### Here is the code we will put in the main calling routine: # r_id = row['id'] # this is the version we will actually keep = 'http://kbpedia.org/kko/rc/AlarmSignal' # temporary assignment just to test code r_id # r_parent = row['parent'] = 'http://kbpedia.org/kko/rc/SoundsByType' r_parent id = row_clean(r_id, iss='i_id') # send the two arguments to helper function = row_clean(r_id, iss='i_id_frag') id_frag = row_clean(r_parent, iss='i_parent') parent = row_clean(r_parent, iss='i_parent_frag') parent_frag print('id:', id) # temporary print to check if results OK print('id_frag', id_frag) print('parent:', parent) print('parent_frag:', parent_frag)
Because we have entered some direct assignments the code block above does Run (or
Note in the main calling routine code that to get our routine values we are calling the
row_clean function and passing the required two arguments: the value for either the ‘id’ or ‘parent’ in that row, and whether we want prefixed or shortened fragments.
I strongly suspect there are better and shorter ways to remove this duplicate code, but this approach with a helper function, even in a less optimal form, still has cut the original code length in half (36 lines to 18 lines due to three duplicates). Expect to see a similar form to this in our code going forward. (NB: I am finding that looking for these duplicate code blocks is forcing me to learn function definitions and seek shorter but more expressive forms.)
Looping Over Files
If you recall our extraction steps of getting flat CSV files out of KBpedia in CWPK #28 to CWPK #35, we can end up with close to 100 extraction files. These splits encourage modularity and are easier to work on or substitute. Still, when it comes time to building KBpedia back up again after we complete a roundtrip, a complete build requires we process many files. We thus need looping routines across our build files to automate this process.
The first thought is to simply put groupings of files in individual directories and then point the routine at a directory and instruct it to loop over all files. If we have concerns that the directories may have more file types than we want to process with our current routine, we could also introduce some file name string checks to filter by name, fragment, or extension. These options would enable us to generalize a file looping routine to apply to many conditions.
But, I’ve decided to take a different choice. Since our extractions are driven by Python dictionaries, and we can direct those extractions to any directory prefix, we can re-use these same specifications for build processes. Should we later discover that a general file harvester makes sense, we can generalize at that time from this dictionary design. Also, by applying the same dictionary approach to extraction or building, we help reinforce our roundtripping mindset in how we name and process files.
So, we already have the unique names that distinguish our input classes (in the
typol_dict dictionary in
config.py) and our properties (in the
prop_dict dictionary), and foresee using additional dictionaries going forward in this CWPK series. We only need enter a directory root and the appropriate dictionary to loop over the unique terms associated with our various building blocks. For classes, the typology listing is a great lookup.
We will take our generic class build template from the last installment, and put it into a function that loops over opening our file set, running the routine, and then saving to our desired output location. For now, to get the logic right, I will just set this up as a wrapper before actually plopping in the full build loop routine. (Note: we have to import a couple of modules because we have not yet fully set the environment for today’s installment):
from cowpoke.config import * import csv def class_builder(**build_deck): print('Beginning KBpedia class structure build . . .') = '' r_default = '' r_label = '' r_iri # probably want the run specification here (see CWPK #35 for render in struct_extractor) = build_deck.get('loop_list') loop_list = build_deck.get('loop') loop = build_deck.get('class_loop') class_loop = build_deck.get('base') base = build_deck.get('ext') ext if loop is not 'class_loop': print("Needs to be a 'class_loop'; returning program.") return for loopval in loop_list: print(' . . . processing', loopval) = loopval.replace('kko.','') frag = (base + frag + ext) in_file = 1 x with open(in_file, mode='r', encoding='utf8') as input: = True is_first_row = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent']) reader for row in reader: ## Here is where we place the real class build routine if x <= 2: = row['id'] r_id = row['parent'] r_parent print(r_id, r_parent) = x + 1 x input.close() **build_deck) class_builder(
OK. We now know how to loop over our class build input files. Now, we can Kernel → Restart & Clear Outputs → and then Restart and Clear All Outputs (which should be a familiar red button to you if using Jupyter Notebook) to get ourselves to a clean starting place, to begin setting up our structure build environmment.
Setting Up the Build Environment
As before with our extract routines, we now have a
build_deck dictionary of build configuration settings in
config.py. If you see some unfamiliar switches as we proceed through this build process, you may want to inspect that file. The settings are pretty close analogs to the same types of settings for our extractions, as specified in the
run_deck dictionary. Most all of this code will migrate to the new
We begin by importing our necessary modules and setting our file settings for the build:
from owlready2 import * from cowpoke.config import * # from cowpoke.__main__ import * import csv import types = World() world = every_deck.get('kb_src') # we get the build setting from config.py kb_src #kb_src = 'standard' # we can also do quick tests with an override if kb_src is None: = 'standard' kb_src if kb_src == 'sandbox': = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl' kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl' kko_file elif kb_src == 'standard': = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl' kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl' kko_file elif kb_src == 'start': = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl' kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl' kko_file else: print('You have entered an inaccurate source parameter for the build.') = 'http://www.w3.org/2004/02/skos/core' skos_file
We load our ontologies into owlready2 and set our namespaces:
= world.get_ontology(kbpedia).load() kb = kb.get_namespace('http://kbpedia.org/kko/rc/') rc #skos = world.get_ontology(skos_file).load() #kb.imported_ontologies.append(skos) #core = world.get_namespace('http://www.w3.org/2004/02/skos/core#') = world.get_ontology(kko_file).load() kko kb.imported_ontologies.append(kko)= kb.get_namespace('http://kbpedia.org/ontologies/kko#')kko
Since we’ve cleared memory and our workspace, we again add back in our new
row_clean helper function:
def row_clean(value, iss=''): # arg values come from calling code if iss == 'i_id': # check to see which replacement method = value.replace('http://kbpedia.org/kko/rc/', 'rc.') value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.') value return value # returns the calculated value to calling code if iss == 'i_id_frag': = value.replace('http://kbpedia.org/kko/rc/', '') value = value.replace('http://kbpedia.org/ontologies/kko#', '') value return value if iss == 'i_parent': = value.replace('http://kbpedia.org/kko/rc/', 'rc.') value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.') value = value.replace('owl:', 'owl.') value return value if iss == 'i_parent_frag': = value.replace('http://kbpedia.org/kko/rc/', '') value = value.replace('http://kbpedia.org/ontologies/kko#', '') value = value.replace('owl:', '') value return value
Running the Complete Class Build
And then add our class build template to our new routine for iterating over all of our class input build files. CAUTION: to process all inputs to KBpedia, best done with the single assignment of the
Generals typology (since all other typologies not already included in KKO are children of it), takes about 70 min on a conventional desktop.
You may notice that we made some slight changes to named variables in the draft template developed in the last installment:
And, we have placed it into a defined function,
def class_struct_builder(**build_deck): # Note 1 print('Beginning KBpedia class structure build . . .') # Note 5 = typol_dict.values() # Note 2 kko_list = build_deck.get('loop_list') loop_list = build_deck.get('loop') loop = build_deck.get('class_loop') class_loop = build_deck.get('base') base = build_deck.get('ext') ext if loop is not 'class_loop': print("Needs to be a 'class_loop'; returning program.") return for loopval in loop_list: print(' . . . processing', loopval) # Note 5 = loopval.replace('kko.','') frag = (base + frag + ext) in_file with open(in_file, 'r', encoding='utf8') as input: = True is_first_row = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent']) reader for row in reader: = row['id'] r_id = row['parent'] r_parent id = row_clean(r_id, iss='i_id') # Note 3 = row_clean(r_id, iss='i_id_frag') id_frag = row_clean(r_parent, iss='i_parent') parent = row_clean(r_parent, iss='i_parent_frag') parent_frag if is_first_row: = False is_first_row continue with rc: = None kko_id = None kko_frag if parent_frag == 'Thing': if id in kko_list: = id kko_id = id_frag kko_frag else: id = types.new_class(id_frag, (Thing,)) if kko_id != None: with kko: = types.new_class(kko_frag, (Thing,)) kko_id with open(in_file, 'r', encoding='utf8') as input: = True is_first_row = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent']) reader for row in reader: = row['id'] r_id = row['parent'] r_parent id = row_clean(r_id, iss='i_id') = row_clean(r_id, iss='i_id_frag') id_frag = row_clean(r_parent, iss='i_parent') parent = row_clean(r_parent, iss='i_parent_frag') parent_frag if is_first_row: = False is_first_row continue with rc: = None kko_id = None kko_frag = None kko_parent = None kko_parent_frag if parent_frag is not 'Thing': if id in kko_list: continue elif parent in kko_list: = id kko_id = id_frag kko_frag = parent kko_parent = parent_frag kko_parent_frag else: = getattr(rc, id_frag) var1 = getattr(rc, parent_frag) var2 if var2 == None: continue else: var1.is_a.append(var2)if kko_parent != None: with kko: if kko_id in kko_list: continue else: = getattr(rc, kko_frag) var1 = getattr(kko, kko_parent_frag) var2 var1.is_a.append(var2)with open(in_file, 'r', encoding='utf8') as input: # Note 4 = True is_first_row = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent']) reader for row in reader: = row['id'] r_id = row['parent'] r_parent id = row_clean(r_id, iss='i_id') = row_clean(r_id, iss='i_id_frag') id_frag = row_clean(r_parent, iss='i_parent') parent = row_clean(r_parent, iss='i_parent_frag') parent_frag if is_first_row: = False is_first_row continue if parent_frag == 'Thing': # This is the new code section, replacing the commented out below # Note 4 = getattr(rc, id_frag) var1 = getattr(owl, parent_frag) var2 try: var1.is_a.remove(var2)except Exception: # var1 = getattr(kko, id_frag) # print(var1) # var1.is_a.remove(owl.Thing) # print('Last step in removing Thing') continue # print(var1, var2) # if id in thing_list: # continue # else: # if id in kko_list: # var1 = getattr(kko, id_frag) # thing_list.add(id) # else: # var1 = getattr(rc, id_frag) # var2 = getattr(owl, parent_frag) # if var2 == None: # print('Empty Thing:') # print('var1:', var1, 'var2:', var2) # try: # var1.is_a.remove(var2) # except ValueError: # print('PROBLEM:') # print('var1:', var1, 'var2:', var2) # if len(thing_list) == 0: # print('thing_list is empty.') # else: # print(*thing_list) # break # print(var1, var2) # thing_list.append(id) # thing_list.add(id) = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/build_stop.csv' out_file with open(out_file, 'w', encoding='utf8') as f: print('KBpedia class structure build is complete.') 'KBpedia class structure build is complete.') # Note 5 f.write( f.close()
Our function call pulls up the same keyword argument passing that we discussed for the extraction routines earlier (1). The double asterisk
(**build_deck) argument means to bring in any of that dictionary’s keyword values if referenced in the routine. We can readily pick up loop or lookup specifications by referencing a dictionary (2). The
kko_list is a handy one since it gives us a basis for selecting between KKO objects and the reference concepts (RCs) in KBpedia. The revised routine above also brings in our new helper function (3).
Pretty much the next portions of the routine are as described in the last installment, until we come up to Pass #3 (4), which is where we hit a major roadblock (coming up around the next bend in the road). We also added some print statements (5) that give feedback when the routine is running.
To run this file locally you will need to have the cowpoke project installed and know where to find your
build_ins/typology directory. You also need to make sure your settings in
config.py are properly set for your conditions. Assuming you have done so, you can invoke this routine (best with only a subset of your typology dictionary, assigned to, say,
Realize everything has to be configured properly for this code to run. You will need to review earlier installments if you run into problems. Assuming you have gotten it to run to completion without error, you may want to then save it. We need to preface our ‘save’ statement with the ‘kb’ ontology identifier. I also have chosen to use the ‘working’ directory for saving these temporary results:
file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl', format='rdfxml') kb.save(
However, I ran into plenty of problems myself. Indeed, the large code block commented out above (4) caused me hours of fits trying to troubleshoot and get the routine to act as I wanted. This whole effort put up a roadblock in my plan, sufficient that I had to add another installment. I explain this detour next.
A Brief History of Going Crazy
If we set as an objective being able to specify multiple input files for a current build, a couple of issues immediately arise. Recall, we designed our typology extraction files to be self-contained, which means that every class used as an object must also be declared as its own class subject. To speed up our extractions, we do not keep track of the many objects so needing definitions. That means each encounter triggers the need for another class definition. Multiple duplicate declarations do not cause a problem when loading the ontology, but when used as a specification input when doing multiple passes some tricky problems arise.
One obvious contributor to the difficulty is the need to identify and separately keep track of (and sometimes differentially process) our ‘kko’ and ‘rc’ namespaces. We need to account for this distinction in every loop and every assignment or removal that we make to the ontology while building it in memory. That can all be trapped for when in the class build cycle, which is the first two passes of the routine (first create the class, second add to parents), but gets decidedly tricky when removing the excess
To appreciate this issue a bit, here is the basic statement for removing a ‘Bird’ class from a parent ‘Reptile’:
Our inputs can not be strings, but in loops variables often become so, and need to be evaluated to their type via the
Unfortunately, when we make a
rc.Bird.is_a.remove(rc.Reptile) request once it has been previously removed, the relationship is empty and owlready2 throws an error (as does Python when trying to remove an undeclared object). So, while we are able to extract without keeping track, we eventually do when we come time to build. Thus, as each file is processed, we need to account for prior removals and make sure we do not make the request again.
The later part of the code listing above (4) kept processing most of the files well, but not when too many were processed. I had the curious error of seeing the routine fail on the first entry of some files. It appeared to me perhaps the list accumulator I was using to keep track of prior removals was limited in size in some manner (it is not) or some counter or loop was not being cleared or initialized in the right location. If it ain’t perfect, it don’t run.
As a newbie with no prior experience to fall back on, here are some of the things I looked at and tested in trying to debug this Pass #3
owl:Thing deletion routine:
- memory – was it a memory problem? Well, there are some performance issues we continue with in the next installment, but, no, Python seems to grab the memory it needs and does (apparently) a fair job of garbage cleanup. It was also not a problem with the notebook memory
- loops – there are lots of ways to initiate loops or iterate over different structures from lists, sets, dictionaries, length and counters, etc. How loops are called and incremented differ by the iterator type chosen. I suspect this is where the issue still resides, because I continue to not have a native feel for:
- sets v lists
- clearing before loops
- referencing the right loops
- using the right fragment – the interplay of namespaces with scope is also not yet intuitive to me. Sometimes it is important to use the namespace prefixed reference to an object, other times not so. I am still learning about scope
- not much worried about syntax because REPL was always running
- list length limitations – I discussed this one above, as was able to eliminate it as the source
- indentations – it is sometimes possible to put what one thinks is the closing statement to a routine at the wrong indentation, so that it runs, but is not affecting the correct code block. In my debugging efforts so far I often find this a source of the problem, especially when there is too much complexity or editing of the code. This is another reason to generalize duplicate code
- code statement placement in order – in a similar way, counters and loop initializations can easily be placed into the wrong spots. The routine often may run, but still not do what you think it is, and
- many others – I’m a newbie, right?
It was so frustrating trying to get this correct because I could get most everything working like I wanted, but then perhaps the routine would fail in the midst of processing a long list or would complete, but, upon inspection, may have missed some items or treated them incorrectly.
What little I do know about such matters tells me to try to pinpoint and isolate the problem. When processing long lists, that means testing for possible error conditions and liberally sprinkling various print statements with different text and different echoing of current values to the screen. For example, in an
else: condition of an
if: statement, I might put a print like:
print('In kko_list loop of None error trap:', var1, var2)
But pinpointing a problem does not indicate how to solve it, though it does help to narrow attention. I had done so in the routine above, but I was still erroring out of some files. Sometimes that would happen, but it was still unclear what the offending part might be. When Python errors like that, it provides an error message and trackback, but somethings that information is cryptic. The failure point may occur any time after the last message to screen. Again, I was being pricked by needles in the haystack, but I still had not specifically found and removed them.
I knew from my Python reading that it had a fairly good exception mechanism. Since
print() statements were only taking me so far, I decided I needed to bite the bullet (for the needle pricks in my hand!) and start learning more about error trapping.
The basic approach for allowing a program to continue to run when an error condition is met is through the Python exception. It basically looks like this kind of routine:
non_zero = statement1 / statement2
print('Oops, dividing by 0!')
I was exploring this more graceful way to treat errors when I realized, duh, that same approach also captured exactly what I was trying to accomplish with avoiding multiple deletions in the first place! That is, I could continue to ‘try’ to delete the next instance of the
owl:Thing assigment, and if it had already been deleted (which caused it to throw an exception, that is, what I was trying to fix!), I could exit gracefully and move on. Further, this would allow me to embed specific
print() statements at the exact point of failure.
After this aHa! I changed the code as shown above (4). I suspect it is a slow way to process the huge numbers I have, but it works. I will continue to look for better means, but at least with this approach I was able to move on with the project.
Still, whether for this reason or others not yet contemplated, once we start processing huge numbers with multiple KBpedia build files, I am seeing performance much slower than what I would like. We address those topics in the next installment, which will also cause us to detour still further before we can get back on track to completing our property structure additions to the build.
*.ipynbfile. It may take a bit of time for the interactive option to load.
3 thoughts on “CWPK #40: Looping and Multiple Structure File Ingest”
I have noticed throughout the past CWPKs that whenever I was looping through the typologies, I would get error messages such as
“* Owlready2 * Warning: ignoring cyclic subclass of/subproperty of, involving:”
with some reference concepts linked below this error. One of the specific concepts that were linked here was http://kbpedia.org/kko/rc/Person and http://kbpedia.org/kko/rc/HomoSapiens. Moreover, the large code block for the function class_struct_builder did not work for me and raised a TypeError in the Pass #2 part of the code block with rc, where the code reads as var1.is_a.append(var_2). The error reads as: TypeError: a __bases__ item causes an inheritance cycle.
For reference, my build_deck has the base as ‘kbpedia/v300/build_ins/typologies/typol_’ in relation to CWPK 39 running the similar code block with src_file = ‘kbpedia/v300/build_ins/typologies/typol_AudioInfo.csv’. Since there isn’t really a v300 folder on the github for kbpedia, I used the typologies folder from the sandbox folder in CWPK. Perhaps I wasn’t supposed to do this.
What does the owlready2 warning mean? I will try to figure this out on my own, but if you have come across this error in your troubleshooting, I would appreciate the guidance on how to fix it.
Yes, as I mentioned first in CWPK #25, you can ignore these warning messages. In the case of Person and HomoSapiens, this warning is the result of a purposeful design decision where we represent humans with two concepts, one related to ‘personhood’ and the other related to ‘biological animals’. This separation enables us to treat the Persons and Animals typologies as distinct. Some might argue with this design decision, but we chose to take it because the scope of each of those typologies is distinct in our view. In early versions of owlready2 such cyclic references caused the code to throw an error. But, for similar reasons to what we do, the developer changed the code to merely show a warning. Again, you may ignore (or, for your own KGs, make sure that both concepts are not asserted as subclasses of the other, which will remove the cycle and the warnings).
As for the v300 reference, that comes about because of the ongoing development of the code base. We will eventually be producing a new version 3.00 in this series, but have not yet gotten to that installment. To make sure this routine works, make sure that all of the typologies called by the dictionary are indeed in the folder you are referencing. I suspect you are missing one or more, or perhaps have a name mismatch. If that does not solve the problem, let me know and I can work with you offline to make sure your environment is clean.
(BTW, unfortunately, the working integrity of the code base may need to await the completion of the series when all files are written and covered. I’m trying to make sure things work every step of the way, but that is kind of hard with the dynamic changes happening daily. 😉 )
I have checked out if all of the typologies in typol_dict are actually in the folder that I am referencing, and it turns out that it indeed has all of the typologies. I will troubleshoot a bit more till the weekend to try to more accurately catch the issue. If it still doesn’t work, I will send you an email.
BTW, I noticed that you are using printed statements as some time of progress updates in a lot of these code blocks. I would recommend using the tqdm package that displays the progress of your work here: https://github.com/tqdm/tqdm. It only has you change your code by turning loop_list into tqdm(loop_list) and gives you a nice progress bar for this large looping jobs.