Posted:September 29, 2020

CWPK #44: Annotation Ingest

More Fields, But Less Complexity

We now tackle the ingest of annotations for classes and properties in this installment of the Cooking with Python and KBpedia series. In prior installments we built the structural aspects of KBpedia. We now add the labels, definitions, and other assignments to them.

As with the extraction routines, we will split these efforts into class annotations and then property annotations. Our actual load routines are fairly straightforward, and we have no real logic concerns in how these annotations get added. The most complex wrinkle we will need to address are those annotation fields, altLabels and notes in particular, where we have potentially many assignments for a single reference concept (RC) or property. Like we saw with the extraction routines, for these items we will need to set up additional internal loops to segregate and assign the items for loading based on our standard double-pipe (‘||’) delimiter.

The two functions we develop in this installment, class_annot_builder and prop_annot_builder will be added to the build.py module.

Start-up

Since we are in an active part of the build cycle, we want to continue with our main knowledge graph in-progress for our load routine, so please make sure that kb_src is set to ‘standard’ in your config.py configuration. We then invoke our standard start-up:

from cowpoke.__main__ import *
from cowpoke.config import *

Loading Class Annotations

Class annotations consist of potentially the item’s prefLabel, altLabels, definition, and editorialNote. The first item is mandatory, the next two should be provided to adhere to best practices. The last is optional. There are, of course, other standard annotations possible. Should your own conventions require or encourage them, you will likely need to modify the procedure below to account for that fact.

As with these methods before, we provide a header showing ‘typical’ configuration settings (in config.py), and then proceed with a method that loops through all of the rows in the input file. Here is the basic class annotation build procedure. There are no new wrinkles in this routine from what has been seen previously:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : file_dict.values(),                           # see 'in_file'
# 'loop'          : 'class_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/Generals_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts_test.csv',


def class_annot_build(**build_deck):
    print('Beginning KBpedia class annotation build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
#    r_id = ''
#    r_pref = ''
#    r_def = ''
#    r_alt = ''
#    r_note = ''
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=[C])                 
            for row in reader:
                r_id_frag = row['id']
                id = getattr(rc, r_id_frag)
                if id == None:
                    print(r_id_frag)
                    continue
                r_pref = row['prefLabel']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                id.prefLabel.append(r_pref)
                id.definition.append(r_def)
                i_alt = r_alt.split('||')
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
    print('KBpedia class annotation build is complete.')               
class_annot_build(**build_deck)
kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format='rdfxml') 

BTW, when we commit this method to our build.py module, we will add the save routine at the end.

Loading Property Annotations

We now turn our attention to annotations of properties:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : prop_dict.values(),                           # see 'in_file'
# 'loop'          : 'class_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts_test.csv',

def prop_annot_build(**build_deck):
    print('Beginning KBpedia property annotation build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    out_file = build_deck.get('out_file')
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain',  
                                   'range', 'functional', 'altLabel', 'definition', 'editorialNote'])                 
            for row in reader:
                r_id = row['id']                
                r_pref = row['prefLabel']
                r_dom = row['domain']
                r_rng = row['range']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                r_id = r_id.replace('rc.', '')
                id = getattr(rc, r_id)
                if id == None:
                    print(r_id)
                    continue
                if is_first_row:                                       
                    is_first_row = False
                    continue
                id.prefLabel.append(r_pref)
                i_dom = r_dom.split('||')
                if i_dom != ['']: 
                    for item in i_dom:
                        id.domain.append(item)
                if 'owl.' in r_rng:
                    r_rng = r_rng.replace('owl.', '')
                    r_rng = getattr(owl, r_rng)
                    id.range.append(r_rng)
                elif r_rng == ['']:
                    continue
                else:
#                    id.range.append(r_rng)
                i_alt = r_alt.split('||')    
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                id.definition.append(r_def)        
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
    print('KBpedia property annotation build is complete.') 
prop_annot_build(**build_deck)

Hmmm. One of the things we notice in this routine is that our domain and range assignments have not been adequately picked up in our earlier KBpedia version 2.50 build routines (the ones undertaken in Clojure before this CWPK series). As a result, we can not adequately test range and will need to address this oversight before our series is over.

As before, we will add our ‘save’ routine as well when we commit the method to the build.py module.

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format='rdfxml') 

We now have all of the building blocks to create our extract-build roundtrip. We summarize the formal steps and configuration settings in CWPK #47. But, first, we need to return to cleaning our input files and instituting some unit tests.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Schema.org Markup

headline:
CWPK #44: Annotation Ingest

alternativeHeadline:
More Fields, But Less Complexity

author:

image:
https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:
This CWPK installment tackles the ingest of annotations for classes and properties for KBpedia, with an emphasis on labels (preferred and alternate), definitions, and notes.

articleBody:
see above

datePublished:

6 thoughts on “CWPK #44: Annotation Ingest

  1. I am a bit confused about having loop_list be referring to file_dict.values() because in config.py file_dict only has ‘wikipedia-categories’ as is shown in https://github.com/Cognonto/cowpoke/blob/master/config.py

    Should we have in_file not be the items in loop_list for both of these methods? It seems like file_dict won’t refer to these. I guess what may have been intended is to have file_dict refer to the in_file in both methods.

  2. Hi Varun,

    Another good catch. I think the proper value here is prop_dict.values(). I must have not updated the setting after an earlier test. I will update the Binder files tomorrow when I post the next installment. I have changed the blog posting here.

    Thanks, Mike

  3. I got a bit busy during this past month and just began going through the CWPK series again. Trying to read this again, and I couldn’t understand what file_dict.values() should be for class_annot_build.

    Moreover, I was a bit confused why in both functions we use in_file = loopval here. The reason I’m confused is that we want to have ‘in_file’ as part of the build_deck dictionary. So according to the code in here (and in cowpoke), we’d want to open the file which is loopval (in the case for prop_annot_build), so this means that we are trying to open ‘kko.predicateProperties’, ‘kko.predicateDataProperties’, ‘kko.representations’. I thought that we’d want in_file to be C:/1-PythonProjects/kbpedia/v300/build_ins/other_stuff.

    Sorry for asking about this again, I’m just a bit confused.

  4. Hi Varun,

    Good; I’m glad you figured it out. However, I would not be surprised if some of the header instruction information is wrong or lists unused variables. (Perhaps a leftover from cut-and-paste.) If you care to cite the specific CWPK and routine, I will look at it to make sure the instructions are accurate.

    Thanks!

  5. Well I was looking at CWPK 47 and was looking at both prop2_annot_build and class2_annot_build. The loop_list for both of these are supposed to be the one item in in_file, correct?

Leave a Reply

Your email address will not be published. Required fields are marked *