Posted:September 4, 2020

CWPK #30: Extracting Annotations

Everything Can Be Annotated in a Knowledge Graph

We’ve seen in the previous two installments of this Cooking with Python and KBpedia series various ways to specify a subset population for driving an iterative process for extracting structure from KBpedia. We’re going to retain that iterative approach, only change it now to extract annotations. Classes, properties, and instances (individuals) may all be annotated in OWL. We thus need to derive generalized approaches that can apply to any entity in a knowledge graph.

Annotations are information applied to a given entity in order to point to it, describe it, or identify it. As a best practices matter, there are certain fields we recommend be universally applied to annotate any given entity:

  • A preferred label (prefLabel) that is the standard name or title for a thing
  • A multiple of alternative labels (altLabel) that capture any of the ways a given thing may be referred to, including synonyms, acronyms, jargon, etc.
  • A definition of the thing (definition)
  • All labels should be tagged with a language tag in order to more readily support translation and use in multiple languages.

We may also find comments or notes associated with particular items. Further, in the case of object or data properties, we may have additional characterizations such as domain or range or functionality assigned to the item. We could have retrieved these characterizations as part of our structural extractions, but decided to include them rather in an annotation extraction pass (even though those characterizations are not annotative).

Items to be Extracted During Annotation Pass

We can thus assemble up a list of items that may be extracted during an annotation extraction pass. We could do these extractions in parts, since that is often the better approach during the inverse process of building our knowledge graph. However, given the number of annotation and related items that may be extracted, and the number of combinations of same, we decide as a matter of simplicity to extract all such information as a single record for each subject entity. We can later manipulate the large flat files so generated if we need to focus on subsets of them. We may revisit this question once we tackle the build side of this roundtripping process.

Some of the items that we will extract have multiple entries per subject. Parental class is one such item, as are alternative labels, which may number into the tens for a rather complete characterization. From our experience in the last installments we know we will need to set up some inner loops to accommodate such multiple entries. So, with these understandings, we can now compile up a list of items that may be extracted on an annotation extraction pass, including whether the item is limited to a single entry, or may have many:

  • IRI fragment name: single
  • prefLabel: single
  • altLabel: many
  • superclass: many
  • definition: single
  • editorialNote: many
  • mapping properties: many (a characterization that will grow over time)
  • comment: many
  • domain: single (object and data properties, only)
  • range: single (object and data properties, only)
  • functional type: single (object and data properties, only)

So, we decide to develop two variants of our code block. A standard one, and an expanded one that includes the object and data property additions. The IRI fragment name is the alias used internally in our Python programs and what gets concatenated with the base IRI to form the full IRI for the entity.

Also, to maintain the idea of a single line per subject entity, we decide that: 1) we will separate multiple entries for a given item with the ‘||’ (“double pipe”) separator, which we use because it is never used in the wild and it is easy to spot when scanning code; and 2) we will not use full IRIs in order to aid record readability.

(BTW, if we decide over time to add other standard characterizations to our items we will adjust our routines accordingly.)

Starting and Load

We again begin with our standard opening routine, except we have now substituted ‘kbpedia’ for ‘main’ in the first line, to make our reference going forward more specific:

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service (the ‘raw’ version) or local files. The example below is based on using local files (though replace with your own local directory specification). If loading from MyBinder, replace with the lines that are commented (#) out.
kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# kbpedia = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'


from owlready2 import *
world = World()
kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Like always, we execute each cell as we progress down this notebook page by pressing shift+enter for the highlighted cell or by choosing Run from the notebook menu.

Basic Extraction Set-up

We tackle the smaller (non-property) variant of our code block first, treating the extracted items listed above as the members of a Python set specification. We also choose to prefix our variables with annot_. We will first start with a single item, foregoing the loop for the moment, to test if we have gotten our correspondences right. For the class set-up we’ll use the relatively small rc.Luggage class. (You may substitute any KBpedia RC as this item.)

s_item = rc.Luggage
annot_pref = s_item.prefLabel
annot_sup  = ''
# annot_sup  = s_item.superclass  # maybe it should be is_a
annot_alt  = s_item.altLabel
annot_def  = s_item.definition
annot_note = s_item.editorialNote
annot = [annot_pref, annot_sup, annot_alt, annot_def, annot_note]
print(annot)
[['baggage'], '', ['bag', 'bags', 'luggage'], ['This ‘class’ product category is drawn from UNSPSC, readily converted into other widely used systems. This product ‘class’ is an aggregate of luggage. This product category corresponds to the UNSPSC code: 53121500.'], []]

We need to add a few items to deal with specific property characteristics including domain, range, and functional type (which is blank in all of our cases):

s_item = kko.representations
annot_pref = s_item.prefLabel
annot_sup  = s_item.is_a
annot_dom  = s_item.domain
annot_rng  = s_item.range
annot_func = ''
annot_alt  = s_item.altLabel
annot_def  = s_item.definition
annot_note = item.editorialNote
annot = [annot_pref, annot_sup, annot_dom, annot_rng, annot_func, annot_alt, annot_def, annot_note]
print(annot)
[['representations'], [owl.AnnotationProperty], [], [], '', ['annotations', 'indexicals', 'metadata'], ['Pointers or indicators, including symbolic ones such as text or URLs, that draw attention to the actual or dynamic object.'], []]

KBpedia does not use functional properties at present. I leave a placeholder above, but have not worked out the owlready2 access methods.

Working Out the Code Block

A quick inspection of these outputs flags a few areas of concern. We see that items are often enclosed in square brackets (a set notation in Python), we have many quoted items, and we have (as we knew) mutiple entries for some fields, especially altLabel and parents. In order to test our code block out, we will need to have a test set loaded. We decide to keep on with rc.Luggage, but I throw in a length count. You can substitute any non-leaf RC into the code if you want a larger or smaller or different domain test set.

root = rc.Luggage
s_set=root.descendants()

len(s_set)
25

For the iteration part for the multiple entries, we begin with the code blocks used for the inner loops dealing with the structural backbone issues in CWPK #28 and CWPK #29. But the purpose of tracing inheritance is different than retrieving values for multiple attributes of a single entity. Maybe we should tackle what seems to be an easier concern to remove the enclosing brackets (‘[ ]’).

I also decide as we test out these code blocks that I would shorten the variable names to reduce the amount of typing and to reflect a more general procedure. So, all of the annot_ prefixes from above become a_.

Poking around I first find a string replacement example, followed by the .join method for strings:

for s_item in s_set:
  a_pref = s_item.prefLabel
  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_note = s_item.editorialNote
  a_     = [a_pref,a_sup,a_alt,a_def,a_note]
  def listToStringWithoutBrackets(a_):
    return str(a_).replace('[','').replace(']','')
  listToStringWithoutBrackets(a_)
  print( ','.join( repr(e) for e in a_ ) )
len(s_set)

I try another string substitution example with similarly disappointing results:

for s_item in s_set:
  a_pref = s_item.prefLabel
  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_note = s_item.editorialNote
  a_     = [s_item,a_pref,a_sup,a_alt,a_def,a_note]
  a_     = [str(i) for i in a_]
  a_out  = ','.join(a_).strip('[]') 
  print(a_out)

The reason these, and multiple other string approaches failed, was that we are dealing with results sets with multiple entries. It seemed like the safest way to ensure the fields were treated as strings was to explicitly declare them as such, and then manipulate the string directly. So, in the code below, we grab the property from the entity, convert it to a string, and then remove the first and last characters of the entire string, which in our case are the brackets. Note in this test code that I also (temporarily) comment out the two fields where we have possibly multiple items that we want to loop over and concatenate into a single string entry:

a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]                   # this is one way to remove opening and closing characters ([ ])
#  a_sup  = s_item.is_a
#  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')

We still see brackets in the listing, but those are for the two properties we commented out. All other brackets are now gone. While I really do not like repeating the same bracket removal code multiple times, it works, and after spending perhaps more time than I care to admit trying to find a more elegant solution, I decide to accept the workable over the perfect. I am hoping when we loop over the elements for the two fields commented out that we will be extracting the element from each pass of the loop, and thus via processiing will see the removal of their brackets. (That apparently is what happens in the loop steps below.)

Now, it is time to tackle the harder question of collapsing (also called ‘flattening’) a field with multiple entries. The basic idea of this inner loop is to treat all elements as strings, loop over the multiple elements, grab the first one and end if there is only one, but to continue looping if there is more than one until the number of elements is met, and to add a ‘double pipe’ (‘||’) character string to the previously built elements before concatenating the current element. This order of putting the delimiter at the beginning of each loop result is to make sure our final string with all concatenated results does not end with a delimiter. The skipping of the first pass means no delimiter is added at the beginning of the first element, also good if there is only one element for a given entity, which is often the case.

There are very robust for and while operators in Python. The one I settled on for this example uses an id,enumerate tuple where we get both the current element item and its numeric index:

a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
#  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  for a_id, a in enumerate(a_alt):            # here is the added inner loop as explained in text
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_alt  = a_item
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')

Now that the inner loop example is working we can duplicate the approach for the other inner loop and move on to putting a full working code block together.

Class Annotations

OK, so we appear ready to start finalizing the code block. We will start with class annotations because they have fewer fields to capture. The first step we want to do is to remove the pesky rc. namespace prefix in our output. Remember, this came from a tip in our last installment:

def render_using_label(entity):
    return entity.label.first() or entity.name

set_render_func(render_using_label)

(How to set it back to the default is described in the prior installment.)

We also pick a class and its descendants to use in our prototype example. I also add a len statement in the code to indicate how many classes we will be processing in this example:

root = rc.Luggage
s_set=root.descendants()

len(s_set)

We now expand our code block to set our initial iterator to an empty string, fix (remove) the brackets, and process the two inner loops of the altLabels and parent classes putting the “double pipe” (‘||’) between entries:

a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
  a_sup  = s_item.is_a
  for a_id, a in enumerate(a_sup): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_sup + '||' + str(a)
    a_sup  = a_item
  a_alt  = s_item.altLabel
  for a_id, a in enumerate(a_alt): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_alt + '||' + str(a)
    a_alt  = a_item
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')
EveningBag,'evening bag',Purse,evening bags,'The collection of all evening bags. A type of Purse. The collection EveningBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
Gucci,'Gucci',Luggage,GUCCI,'Gucci (/ɡuːtʃi/; Italian pronunciation: [ˈɡuttʃi]) is an Italian luxury brand of fashion and leather goods, part of the Gucci Group, which is owned by the French holding company Kering. Gucci was founded by Guccio Gucci in Florence in 1921.Gucci generated about €4.2 billion in revenue worldwide in 2008 according to BusinessWeek and climbed to 41st position in the magazine\'s annual 2009 \\"Top Global 100 Brands\\" chart created by Interbrand; it ranked retained that rank in Interbrand\'s 2014 index. Gucci is also the biggest-selling Italian brand. Gucci operates about 278 directly operated stores worldwide as of September 2009, and it wholesales its products through franchisees and upscale department stores. In the year 2013, the brand was valued at US$12.1 billion, with sales of US$4.7 billion.',
Briefcase,'briefcase',CarryingCase||Device-OfficeProduct-NonConsumable||OfficeProductMarketCategory||Box-Container,Attache cases||Attaché case||Attaché cases||Brief case||Handlebox||Portfolio (briefcase)||briefcases,'The collection of all briefcases, which are small portable cases designed to carry hardcopies of documents. A type of Luggage and Box_Container. The collection Briefcase is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType. This ‘commodity’ product category is drawn from UNSPSC, readily converted into other widely used systems. This commodity category is for briefcase. This product category corresponds to the UNSPSC code: 53121701.',
DuffelBag,'duffel bag',Luggage,Duffel bags||Duffle||Duffle Bag||Duffle bag||Dufflebag||Kit bag||Kit-bag||Seabag||The dufflebag||duffel bags,'The collection of all duffel bags. A type of Luggage. The collection DuffelBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
ShoulderBag,'shoulder bag',Luggage,shoulder bags,'The collection of all shoulder bags -- luggage bags with straps for carrying over the shoulder. A type of Bag and Luggage. The concept ShoulderBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
GolfBag,'golf bag',CarryingCase,golf bags,'Cylindrical bag about one meter long, open on one end; for carrying golf clubs. Has shoulder strap.',
SkiCarrier,'ski carrier',MechanicalDevice||ContainerArtifact,ski carriers,'The collection of all ski carriers. A type of ContainerArtifact and MechanicalDevice. The concept SkiCarrier is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
Backpack,'knapsack',PurseHandbagBag||SomethingToWear||Luggage,Back packs||Backbacks||Backpacks||Book bags||Book sack||Bookbags||Booksack||Day pack||Daypack||Haver sack||Haver sacks||Haversacks||Knap sack||Knap sacks||Knapsack||Pack sack||Pack sacks||Packsack||Packsacks||Ruck sack||Rucksack||Rucksacks||Schoolbag||backpack||backpacks||knapsacks||rucksack||rucksacks,'The collection of all backpacks. A type of Luggage, NonPoweredDevice, and SomethingToWear. The collection Backpack is an ArtifactTypeByFunction and a SpatiallyDisjointObjectType.',
Luggage,'baggage',TransportationContainerProduct,bags||luggage,'This ‘class’ product category is drawn from UNSPSC, readily converted into other widely used systems. This product ‘class’ is an aggregate of luggage. This product category corresponds to the UNSPSC code: 53121500.',
TravelAccessory,'travel accessory',Luggage,travel gear,'Instances are tangible products which are intended for use during travelling. Often they are lighter-weight, more-compact versions of products people normally use at home.',
DiaperBag,'diaper bag',ShoulderBag,Nappy bag||diaper bags,'The collection of all diaper bags. A DiaperBag is a bag which the caretaker of a baby may put Diapers, wipes, and other baby-related products into to bring with him or her when traveling with a HumanInfant or HumanToddler.',
LuggageTag,'luggage tag',TravelAccessory,Luggage tags||luggage tags,'The collection of all luggage tags. A type of Tag_IBO and TravelAccessory. The collection LuggageTag is an #$AerodromeCollection, a SpatiallyDisjointObjectType, and an ArtifactTypeByGenericCategory.',
WomensToteBag,'tote',PurseHandbagBag,satchels||totes||women's satchel||women's satchels||women's tote||women's tote bag||women's tote bags||women's totes,'The collection of all totes. A type of WomensBag. The collection WomensToteBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
LuggageSet,'luggage ensemble',Luggage,luggage set||luggage sets,'The collection of all luggage ensembles. A type of group of baggage, Artifact, and DurableGood. LuggageSet is an ArtifactTypeByGenericCategory.',
Suitcase,'suitcase',Luggage,Suitcases||Swedish lunchbox||Travel bag||Trolley case||Valise||grip||grips||suitcases||travelling bag||travelling bags||valise||valises,'Retangular piece of luggage with handle that is used when you are on a trip.',
PurseHandbagBag,'purse or handbag or bag',ContainerArtifact||LuggageHandbagPack,Clutch (handbag)||Evening bag||Hand bag||Hand-bag||Hand-bags||Handbags||Man bag||Man purse||Man-bag||Manbag||Manpurse,'This ‘class’ product category is drawn from UNSPSC, readily converted into other widely used systems. This product ‘class’ is an aggregate of purse or handbag or bag. This product category corresponds to the UNSPSC code: 53121600.',
PortfolioCase,'portfolio case',CarryingCase,portfolio cases||portfolios,'The collection of all portfolio cases, which are cases desinged for carrying portfolios of art or writing. A type of CarryingCase. The collection PortfolioCase is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
Purse,'purse',NonPoweredDevice||PersonalDevice,Purses||purses,"The collection of all purses. A type of women's clothing accessory, PersonalDevice, and WomensBag. The collection Purse is a SomethingToWearTypeByGenericCategory and a SpatiallyDisjointObjectType.",
ComputerBag,'computer carrying case',Luggage,computer bags||computer carrying cases||laptop bag||laptop bags||laptop carrying case||laptop carrying cases||laptop case||laptop cases,'The collection of all computer carrying cases. A type of accessories for laptop computers and CarryingCase. The collection ComputerBag is an ArtifactTypeByFunction and a SpatiallyDisjointObjectType.',
CarryOnLuggage,'carry on luggage',Luggage,carry on bag||carry on bags||carry-on||carry-on bag||carry-on baggage||carry-on bags||carry-on luggage,'The collection of all carry-on luggage. A type of Luggage. The collection CarryOnLuggage is an ArtifactTypeByFunction, a SpatiallyDisjointObjectType, and an #$AerodromeCollection.',
VoltageConverter,'voltage converter',TravelAccessory,voltage adapters||voltage adaptor||voltage adaptors||voltage converters,'The collection of all voltage converters. A type of TravelAccessory, ElectricalDevice, and travel accessory. VoltageConverter is a SpatiallyDisjointObjectType.',
CarryingCase,'carrying case',Luggage,carrying cases,'The collection of all carrying cases. A type of Luggage and Box_Container. The collection CarryingCase is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
GarmentBag,'hanging bag',Luggage,garment bags||hanging bags||suit bag,'The collection of all garment bags. A type of Luggage. The collection GarmentBag is an ArtifactTypeByFunction and a SpatiallyDisjointObjectType.',
CoinPurse,'coin purse',PurseHandbagBag,[],'This ‘commodity’ product category is drawn from UNSPSC, readily converted into other widely used systems. This commodity category is for coin purse. This product category corresponds to the UNSPSC code: 53121605.',
LuggageOrganizer,'luggage organizer',Organizer||ContainerArtifact,luggage organisers||luggage organizers,'This is the collection of products which fit into luggage whose purpose is to structure the space inside in such as way that the items stored therein are more organized and perhaps better protected.',

The routine now seems to be working how we want it, so we move on to accommodate the properties as well.

Property Annotations

Again, we set the renderer to the ‘clean’ setting and now pick a property and its sub-properties to populate our working set:

def render_using_label(entity):
    return entity.label.first() or entity.name

set_render_func(render_using_label)
root = kko.representations
p_set=root.descendants()

len(p_set)
2901

This example has nearly 3000 sub-properties! That should make for an interesting example. We add our three new properties to the prior code block. We also make another change, which is to substitute the p_ prefix (for properties) over the prior s_ prefix for subject (classes or individual):

p_item = ''
for p_item in p_set:
  a_pref = p_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
  a_sup  = p_item.is_a
  for a_id, a in enumerate(a_sup): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_sup  = a_item
  a_dom  = p_item.domain
  a_dom  = str(a_dom)[1:-1]
  a_rng  = p_item.range
  a_rng  = str(a_rng)[1:-1]
  a_func = ''
  a_alt  = p_item.altLabel
  for a_id, a in enumerate(a_alt): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_alt  = a_item
  a_def  = p_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = p_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(p_item,a_pref,a_sup,a_dom,a_rng,a_func,a_alt,a_def,a_note, sep=',')

Fantastic! It seems that our basic annotation retrieval mechanisms are working properly.

You may have noted the sep=',' argument in the print statement. It means to add a comma separator between the output variables in the listing, a useful addition in Python 3 especially given our reliance on comma-separated value (CSV) files.

We are now largely done with the logic of our extractors. But, before we get to how to assemble the pieces in a working module, it is time for us to take a brief detour to learn about naming and writing output and saving to and reading from files. Since we will be using CSV files heavily, we also work that into next installment’s discussion.

Additional Documentation

The routines in this installment required much background reading and examples having to do with Python loops and string processing. Here are a few I found informative for today’s CWPK installment:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Schema.org Markup

headline:
CWPK #30: Extracting Annotations

alternativeHeadline:
Everything Can Be Annotated in a Knowledge Graph

author:

image:
https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:
We expand our extraction routines to grab any and all annotation properties applied to our knowledge graph, KBpedia.

articleBody:
see above

datePublished:

Leave a Reply

Your email address will not be published. Required fields are marked *