Posted:September 9, 2020

CWPK #32: Iterating Over a Full Extraction

It is Time to Explore Python Dictionaries and Packaging

In our last coding installments in this Cooking with Python and KBpedia series, prior to our single-installment detour to learn about files, we developed extraction routines for both structure (rdfs:subClassOf) and annotations (of various properties) using the fantastic package owlready2. In practice, these generic routines will loop over populations of certain object types in KBpedia, such as typologies or property types. We want a way to feed these variations to the generic routines in an efficient and understandable way.

Python lists are one way to do so, and we have begun to gain a bit of experience in our prior work with lists and sets. But there is another structure in Python called a ‘dictionary’ that sets up key-value pairs of 2-tuples that promises more flexibility and power. The 2-tuple sets up a relationship between an attribute name (a variable name) with a value, quite similar to the associative arrays in JSON. The values in a dictionary can be any object in Python, including functions or other dictionaries, the latter which allows ‘record’-like data structures. However, there may not be duplicate names for keys within a given dictionary (but names may be used again in other dictionaries without global reference).

Dictionaries ('dicts') are like Python lists except list elements are accessed by their position in the list using a numeric index, while we access dict elements via keys. This makes tracing the code easier. We have also indicated that dictionary structures may be forthcoming in other uses of KBpedia, such as CSV or master data. So, I decided to start gaining experience with 'dicts' in this installment.

(Other apparent advantages of dictionaries not directly related to our immediate needs include:

Dictionaries can be expanded without altering what is already there
From Python 3.7 onward, the order entered into a dict is preserved in loops
Dictionaries can handle extremely large data sets
Dicts are fast because they are implemented as a hash table, and
They can be directly related to a Pandas DataFrame should we go that route.)

We can inspect this method with our standard statement:

dir(dict)

The Basic Iteration Approach

In installments CWPK #28, CWPK #29, and CWPK #30, we created generic prototype routines for extracting structure from typologies and properties and then annotations from classes (including typologies as a subset) and properties as well. We thus have generic extraction routines for:

Structure	Annotations
classes	classes
typologies	typologies (possible)
properties	properties

Our basic iteration approach, then, is to define dictionaries for the root objects in these categories and loop over them invoking these generic routines. In the process we want to write out results for each iteration, provide some progress messages, and then complete the looping elements for each root object. Labels and internal lookups to the namespace objects come from the dictionary. In generic terms, then, here is how we want these methods to be structured:

Initialize method
Message: starting method
Get dict iterator:
- Message: iterating current element
- Get owlready2 set iterator for element:
  - Populate row
  - Print to file
Return to prompt without error message.

Starting and Load

To demonstrate this progression, we begin with our standard opening routine:

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service (the ‘raw’ version) or local files. The example below is based on using local files (though replace with your own local directory specification). If loading from MyBinder, replace with the lines that are commented (#) out.

kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# kbpedia = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Like always, we execute each cell as we progress down this notebook page by pressing shift+enter for the highlighted cell or by choosing Run from the notebook menu.

Creating the Dictionaries

We will now create dictionaries for typologies and properties. We will construct them using our standard internal name as the ‘key’ for each element, with the value being the internal reference including the namespace prefix (easier than always concatenating using strings). I’ll first begin with the smaller properties dictionary and explain the sytax afterwards:

prop_dict = {
        'objectProperties'    : 'kko.predicateProperties',
        'dataProperties'      : 'kko.predicateDataProperties',
        'annotationProperties': 'kko.representations',
}

A dictionary is declared either with the curly brackets ({ }) with the colon separator for key:value, or by using the d = dict([(<key>, <value>)]) form. The ‘key’ field is normally quoted, except where the variable is globally defined. The ‘value’ field in this instance is the internal owlready2 notation of <namespace> + <class>. There is no need to align the colons except to enhance readability.

Our longer listing is the typology one:

typol_dict = {
        'ActionTypes'           : 'kko.ActionTypes',
        'AdjunctualAttributes'  : 'kko.AdjunctualAttributes',
        'Agents'                : 'kko.Agents',
        'Animals'               : 'kko.Animals',
        'AreaRegion'            : 'kko.AreaRegion',
        'Artifacts'             : 'kko.Artifacts',
        'Associatives'          : 'kko.Associatives',
        'AtomsElements'         : 'kko.AtomsElements',
        'AttributeTypes'        : 'kko.AttributeTypes',
        'AudioInfo'             : 'kko.AudioInfo',
        'AVInfo'                : 'kko.AVInfo',
        'BiologicalProcesses'   : 'kko.BiologicalProcesses',
        'Chemistry'             : 'kko.Chemistry',
        'Concepts'              : 'kko.Concepts',
        'ConceptualSystems'     : 'kko.ConceptualSystems',
        'Constituents'          : 'kko.Constituents',
        'ContextualAttributes'  : 'kko.ContextualAttributes',
        'CopulativeRelations'   : 'kko.CopulativeRelations',
        'Denotatives'           : 'kko.Denotatives',
        'DirectRelations'       : 'kko.DirectRelations',
        'Diseases'              : 'kko.Diseases',
        'Drugs'                 : 'kko.Drugs',
        'EconomicSystems'       : 'kko.EconomicSystems',
        'EmergentKnowledge'     : 'kko.EmergentKnowledge',
        'Eukaryotes'            : 'kko.Eukaryotes',
        'EventTypes'            : 'kko.EventTypes',
        'Facilities'            : 'kko.Facilities',
        'FoodDrink'             : 'kko.FoodDrink',
        'Forms'                 : 'kko.Forms',
        'Generals'              : 'kko.Generals',
        'Geopolitical'          : 'kko.Geopolitical',
        'Indexes'               : 'kko.Indexes',
        'Information'           : 'kko.Information',
        'InquiryMethods'        : 'kko.InquiryMethods',
        'IntrinsicAttributes'   : 'kko.IntrinsicAttributes',
        'KnowledgeDomains'      : 'kko.KnowledgeDomains',
        'LearningProcesses'     : 'kko.LearningProcesses',
        'LivingThings'          : 'kko.LivingThings',
        'LocationPlace'         : 'kko.LocationPlace',
        'Manifestations'        : 'kko.Manifestations',
        'MediativeRelations'    : 'kko.MediativeRelations',
        'Methodeutic'           : 'kko.Methodeutic',
        'NaturalMatter'         : 'kko.NaturalMatter',
        'NaturalPhenomena'      : 'kko.NaturalPhenomena',
        'NaturalSubstances'     : 'kko.NaturalSubstances',
        'OrganicChemistry'      : 'kko.OrganicChemistry',
        'OrganicMatter'         : 'kko.OrganicMatter',
        'Organizations'         : 'kko.Organizations',
        'Persons'               : 'kko.Persons',
        'Places'                : 'kko.Places',
        'Plants'                : 'kko.Plants',
        'Predications'          : 'kko.Predications',
        'PrimarySectorProduct'  : 'kko.PrimarySectorProduct',
        'Products'              : 'kko.Products',
        'Prokaryotes'           : 'kko.Prokaryotes',
        'ProtistsFungus'        : 'kko.ProtistsFungus',
        'RelationTypes'         : 'kko.RelationTypes',
        'RepresentationTypes'   : 'kko.RepresentationTypes',
        'SecondarySectorProduct': 'kko.SecondarySectorProduct',
        'Shapes'                : 'kko.Shapes',
        'SituationTypes'        : 'kko.SituationTypes',
        'SocialSystems'         : 'kko.SocialSystems',
        'Society'               : 'kko.Society',
        'SpaceTypes'            : 'kko.SpaceTypes',
        'StructuredInfo'        : 'kko.StructuredInfo',
        'Symbolic'              : 'kko.Symbolic',
        'Systems'               : 'kko.Systems',
        'TertiarySectorService' : 'kko.TertiarySectorService',
        'Times'                 : 'kko.Times',
        'TimeTypes'             : 'kko.TimeTypes',
        'TopicsCategories'      : 'kko.TopicsCategories',
        'VisualInfo'            : 'kko.VisualInfo',
        'WrittenInfo'           : 'kko.WrittenInfo'
}

To get a listing of entries in a dictionary, simply reference its name and run:

prop_dict

There are a variety of methods for nesting or merging dictionaries. We do not have need at present for that, but one example shows how we can create a new dictionary, relate it to an existing one, and then update (or merge) another dictionary with it, using the two dictionaries from above as examples:

total_dict = dict(typol_dict)
total_dict.update(prop_dict)
print(total_dict)

This now gives us a merged dictionary. However, whether keys match or vary in number means specific cases need to be evaluated individually. The .update may not always be an appropriate approach.

In these dicts, we now have the population of items (sets) from which we want to obtain all of their members and get the individual extractions. We also have them organized into dictionaries that we can iterate over to complete a full extraction from KBpedia.

Marrying Iterators and Routines

We can now return to our generic extraction prototypes and enhance them a bit to loop over these iterators. Let’s take the structure extraction of rdfs:subPropertyOf from CWPK #29 to extract out structural aspects of our properties. I will keep the form from the earlier installment and comment all lines of code added to accommodate the iterations loops and message feedback. First we will add the iterator:

for value in prop_dict.values():      # iterates over dictionary 'values' with each occurence a 'value'
  root = eval(value)                  # need to convert value 'string' to internal variable
  p_set=root.descendants()

#  o_frag = set()                     # left over from prior development; commented out
#  s_frag = set()                     # left over from prior development; commented out
  p_item = 'rdfs:subPropertyOf'
  for s_item in p_set:
    o_set = s_item.is_a
    for o_item in o_set:
       print(s_item,',',p_item,',',o_item,'.','\n', sep='', end='')
#       o_frag.add(o_item)            # left over from prior development; commented out
#    s_frag.add(s_item)               # left over from prior development; commented out

You could do a len() to test output lines or make other tests to ensure you are iterating over the property groupings.

The eval() function submits the string represented by value to the resident Python code base and in this case returns the owlready2 property object, which then allows proper processing of the .descendants() code. My understanding is that in open settings eval() can pose some security holes. I think it is OK in our case since we are doing local or internal processing, and not exposing this as a public method.

We’ll continue with this code block, but now print to file and remove the commented lines:

out_file = 'C:/1-PythonProjects/kbpedia/sandbox/prop_struct_out.csv'                 # variable to physical file
with open(out_file, mode='w', encoding='utf8') as out_put:                        # std file declaration (CWPK #31)
  for value in prop_dict.values():      
    root = eval(value)                  
    p_set=root.descendants()
    p_item = 'rdfs:subPropertyOf'
    for s_item in p_set:
      o_set = s_item.is_a
      for o_item in o_set:
        print(s_item,',',p_item,',',o_item,'.','\n', sep='', end='', file=out_put) # add output file here

And, then, we’ll add some messages to the screen to see output as it whizzes by:

print('Beginning property structure extraction . . .')                            # print message
out_file = 'C:/1-PythonProjects/kbpedia/sandbox/prop_struct_out.csv'
with open(out_file, mode='w', encoding='utf8') as out_put:
  for value in prop_dict.values():
    print('   . . . processing', value)                                           # loop print message
    root = eval(value)                  
    p_set=root.descendants()
    p_item = 'rdfs:subPropertyOf'
    for s_item in p_set:
      o_set = s_item.is_a
      for o_item in o_set:
        print(s_item,',',p_item,',',o_item,'.','\n', sep='', end='', file=out_put)

Beginning property structure extraction . . .
   . . . processing kko.predicateProperties
   . . . processing kko.predicateDataProperties
   . . . processing kko.representations

OK, so this looks to be a complete routine as we desire. However, we are starting to accumulate a fair number of lines in our routines, and we need additional routines very similar to what is above for extracting classes, typologies and annotations.

It is time to bring a bit more formality to our code writing and management, which I address in the next installment.

Additional Documentation

Here is additional documentation related to today’s CWPK installment:

dict reference
RealPython’s entry on dictionaries
Using Python dictionary as a database
RealPython’s how to interate dictionaries.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

NOTE: This CWPK installment is available both as an online interactive file

or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.

I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted:September 8, 2020

CWPK #31: Reading and Writing Files

We Are Now Generating Info That Requires Persistence

We have been opening and using files for many installments in this Cooking with Python and KBpedia series. However, in the past few installments, our methods are also beginning to generate serious results, some of thousands of lines. It is time for us to learn the basic commands for writing and reading files. We will also weave in a new Python import into our workspace, the CSV module. It will be important for us as we begin to import the basic N3 and CSV files that underly KBpedia‘s build process.

Another point we weave through today’s installment is the usefulness of following Python’s lead in composing the names of our files and methods in a hierarchical and logically named way. Such patterns mean we can more readily find and save the information we want to keep persistent. This patterning, along with some directory structure guidance we will address a few installments from now, help set up a logical way to manage and utilize the information assets in KBpedia (or your own domain extensions of it). Logical organization helps in a system designed for subset selections, semantic technology analysis, and machine learning, where the system itself is built from large, external files and we roundtrip with extractions from the current state of the graph.

But reading and writing files is not a new subject for this series. Since early with our exposure to owlready2 and Jupyter Notebook we have been loading and inspecting parts of KBpedia. You may recognize this call from CWPK #19 regarding the smaller KKO (Kbpedia Knowledge Ontology) kko.owl. We have not set up this notebook page sufficiently yet in this installment, so if you run this cell you will get an error:

onto.save(file = "C:/1-PythonProjects/kbpedia/sandbox/kko-test.owl", format = "rdfxml")

File Object and Methods

To get started, let’s first focus on the file object and the methods that may be applied to it in Python. A ‘file object’ has the following components:

A physical file address, which can be referenced by a variable name
A stated form of text encoding, which we standardize on as UTF-8 in KBpedia
A method to be used when opening, whether read or write or append or others noted below.

A ‘file object’ is given a variable name, which in this section let’s simply call f. Python’s open() and close() methods only apply to file objects. Thus, one can not simply ‘open’ a physical file address without first associating that physical file to a file object.

Once a ‘file object’ is defined by assigning it a variable name, besides f.open() and f.close() certain other methods may be applied to the object:

f.read(size) – returns the text or binary document up to the ‘size’ indicated, which if omitted (f.read()) returns the entire file
f.readline() – returns a single line from the file with a new line character (\n) appended to each line
f.write(string) – writes the contents that must be a string or a string variable, and returns the number of characters in that string
f.tell() – returns an integer giving the file object’s current position in the file
f.seek(offset, where) – re-sets the file object’s location by an offset where the reference point is by convention either the beginning of the file (0), the current position (1) or the end of the file (2), with 0 the default
f.truncate() – see help(file) below and for many other more obscure ones:

help(file)

The print() Function

A quite different but major way to get output from a Python program is the print() built-in function. Because it is often the first statement we learn in a language (‘Hello KBpedia!’), we tend to overlook the complexity and power of this function.

A print() statement provides two kinds of output from Python. The first is the result from an evaluated expression (or code block), say the calculation of a formula. The second kind is for strings (that is, text and written numbers), which can be manipulated in many ways. Either output type may be directed to a file object. If all of the values passed to a print function are strings (or converted to such prior), then virtually all string str functions are available to manipulate the entire list of value information passed to the function.

I encourage you to keep an eye out on the manifest ways the print() function is used in code examples you might inspect.

help(print)

Open, Close and Reading Files

It may not be obvious, but the underlying method to the Owlready2 get.ontology(file) method above is a wrapper around the standard Python file open method.

Note: If you open a file in write ('w') mode the file will be created if one does not already exist, but it will OVERWRITE one if it does exist!! Proceed with caution!

The open method works with different formats:

file = open('C:/1-PythonProjects/kbpedia/sandbox/kko.owl','r', encoding='utf8') 
print(file)

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-70fb0322830f> in <module>
----> 1 file = open('C:/1-PythonProjects/kbpedia/sandbox/kko.owl','r', encoding='utf8')
      2 print(file)

FileNotFoundError: [Errno 2] No such file or directory: 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'

Now, if you repeat the command above, but remove the encoding argument and run again, you are likely to get an output format that indicates encoding='cp1252'. This kind of default encoding assignment can be DISASTROUS in working with KBpedia, where all input files and all output files must be in the UTF-8 encoding. It is best practice to specify encoding directly whenever opening or writing to files.

Here is a slightly different format for opening a file now using a file object method of read():

filename = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
file=open(filename, 'r', encoding='utf8', errors='ignore')
file.read()
print(file)

And, here is a third format using the ‘with open’ pattern of nested statement:

with open('C:/1-PythonProjects/kbpedia/sandbox/kko.owl') as file:
  file.read()
print(file)

Oops! Did not work. Again, however, because we did not specify our encoding, we again get the default. We need to make the encoding explicit. Another thing to look out for is the separation of arguments by commas, which if missing, will throw another error:

with open('C:/1-PythonProjects/kbpedia/sandbox/kko.owl', encoding='utf-8') as file:
  file.read()
print(file)

The with open statement above is the PREFERRED option because the ‘with’ format includes a graceful ‘close’ of the file should anything interrupt your work or the with open routine completes. Under the other options, a file can be left in an open state when a program terminates unexpectedly, possibly affecting the file integrity.

In looking across the options above, let’s make a couple of other points besides the need to specify the encoding. The first is that the file is opened by default in 'r' read-only mode. You can see that mode specified in the print output even when not assigned directly. Other mode options include 'w' when you wish to write or overwrite to the file, and 'a' for append when you wish to add to the end of a file. Here is the complete suite of file mode options:

'r' – opens a file for reading only
'r+' – opens a file for both reading and writing
'rb' – opens a file for reading only in binary format
'rb+' – opens a file for both reading and writing in binary format
'w' – opens a file for writing only
'a' – open for writing; the file is created if it does not exist
'a+' – open for reading and writing; the file is created if it does not exist.

Another thing to observe is that Python may accept more than one name alias for the encoding. Our examples above, for example, use both 'utf8' or 'utf-8' for the encoding argument.

Also, as I admonished early in this CWPK series, try to always assign logical names to your physical file paths. As I noted earlier, there are tricky ways that Windows handles file names versus other operating systems and keeping (and then testing) proper file recognition in a separate assignment means you can develop and work with your code without worrying about file locations and paths. You may also do things programmatically to update or change the file referent for these logical names such that the actual file opened may be specified to a different physical location depending on context.

A last thing to notice is that things like encoding or mode need not be specified as arguments in a given method command. When a default value is given at time of definition of the method (notably something to inspect for ‘built-in’ Python methods such as file), that argument can be left off what is actually written in code, with the default assignment being used. It is thus important to understand the commands you use, the options you may assign directly as an argument assignment, and the defaults they have. Whenever you get into trouble, first try to understand the full scope of the statements and their arguments available to you using the dir and help methods.

Proper exiting of an application or writing to file generally requires you to close() the files you have opened. Again, if you open with the with open pattern, you should generally close gracefully. Nonetheless, here is the formal command, taking advantage of the fact we gave the physical file the logical name of ‘file’:

file.close()
print(file)

Output Options

Well, apparently we have the KKO file object loaded, and we have seen the system recognize the file, but we still see nothing about what the file contains. Generally, of course, we need not inspect contents so long as our programs can access and use the data in them. But in some cases, like now when we are developing routines and we are validating steps, we want to make sure everything has opened properly for reading or to receive our outputs.

At this point, we believe by running the cells above, that we have the kko.owl file in memory using the UTF-8 encoding. Let’s test this premise.

filename = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
file = open(filename, 'r', encoding='utf8')
print(file.read())

Again, while we specified the 'r' reading option, that was strictly unnecessary since that is the default for the argument. But, if in doubt, there is no harm in specifying again.

Here is another format for looping through a file line-by-line, now using an explicit for loop and using a logical filename for our physical file address:

filename = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'

with open(filename, encoding='utf-8') as file:
  lines = file.readlines()
for line in lines:
  print(line)

Hmmm, now that is interesting. The file appearance seems to skip every other line. That is because there is a whitespace character at the end of each line as we noted above under the file discussion. There is a string method for taking care of that, .rstrip() that we add to our routine:

filename = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'

with open(filename, encoding='utf8') as file:
  lines = file.readlines()
for line in lines:
  print(line.rstrip())

The latter iteration option results in us being able to manipulate a string object in the line-by-line display, whcih means we may invoke many str options. Besides the example, two related methods are .lstrip() to remove leading whitespace and .strip() to remove both leading and trailing spaces.

There are as many ways to iterate through the lines of a text file as there are ways to specify loops and iteration sequences in Python using for, while and other iteration forms. Also, there are many ways to conduct string manipulations including case changes, substitutions, counts, character manipulation, etc. To see some of these string (str) options, let’s try the dir() command again:

dir(str)

Writing Files

Like the options for reading a file, there are a number of ways to write output to a file.

In the ‘write’ examples below I have switched our variable file name from ‘file’ to ‘my_file’. Though it is the case there are some Python keywords that you may not use (they will throw an error if used as variable names) and ‘file’ is NOT one of them (search on ‘Python keywords‘ to find a listing of them). ‘file’ is also a not uncommon argument for some methods, including for the print statement. So, to prevent confusion, we’ll switch to ‘my_file’.

Some programmers shorten such variable references to single letters, as we did so ourselves in the last installment (where the variable prefix went from annot_ to a_). That style is OK for generic routines and ones perhaps using internal standards, but more descriptive variable names are helpful when your code is being used for learning or heuristic purposes, as is this case.

OK, so let’s look at some of these writing options:

filename = 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt'

my_file = open(filename, 'w', encoding='utf8')
print('Hello KBpedia!', file=my_file)

my_file.close()

Note in this form we are continuing to specify the encoding and have changed the default 'r' argument switch to 'w' because we now want to be able to write to the file. (Note also we have changed the filename to a name something other than our existing files so that we do not inadvertently overwrite it.) We also need a close() statement to complete the write action and to properly close the file. After you run this cell, go to your standard directory where you first stored your local knowledge graphs and see the print statement in the new file.

The next format uses our preferred form (though if the file is only being created and opened for immediate writing the above form is fine):

filename = 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt'

with open(filename, 'w', encoding='utf8') as my_file:
  print('Hello KBpedia, again!', file=my_file)

Once you have made your file declarations, you may also just write your statements as generated to the file. Notice for the third write statement below that we needed to mix our single and double quotes in order to include a possessive apostrophe in the statement.

filename = 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt'
my_file = open(filename, 'w', encoding='utf8')

my_file.write('Slipping in a reference to KBpedia.')
# More Python stuff
my_file.write('And, then, another reference to KBpedia.')
# More Python stuff
my_file.write("Because I can't stop talking about this stuff!")
# More Python stuff
my_file.close()

But, when we run this cell, we find the file has its text all on one line. Since we don’t want that, we make modifications to the output statement, similar to what we might do for print. In this revision, we want to add a new line to the end of each string:

filename = 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt'
my_file = open(filename, 'w', encoding='utf8')

my_file.write('Slipping in a reference to KBpedia.', '\n')
# More Python stuff
my_file.write('And, then, another reference to KBpedia.', '\n')
# More Python stuff
my_file.write("Because I can't stop talking about this stuff!", "\n")
# More Python stuff
my_file.close()

Grrr, I guess the .write method does not work the same as print(). But the type error indicates we can only pose a single argument to the statement, so we need to get rid of the second argument designated by the comma. Since we are working only with strings here, we can concatenate to get our statement down to a single argument (because what is first evaluated is between the parentheses, which results in a single value passed to the call):

filename = 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt'
my_file = open(filename, 'w', encoding='utf8')

my_file.write('Slipping in a reference to KBpedia.' + '\n')
# More Python stuff
my_file.write('And, then, another reference to KBpedia.' + '\n')
# More Python stuff
my_file.write("Because I can't stop talking about this stuff!" + "\n")
# More Python stuff
my_file.close()

Better, that is more like it as our output file is formatted as we desire.

You can also generate write statements that join together strings in various ways (this snippet does not work alone):

my_file.write(' '.join(('Hello KBpedia!', str(var2), 'etc')))

This brief overview points to either file object methods or the print function as two ways to get output out of your programs. Further, within each of these two major ways there are many styles and approaches that might be taken to get to your desired output goal.

In the case of KBpedia where we use flat CSV files as our canonical exchange form, which are themselves by definition built from strings, we will tend to use write() function as our preferred way to prepare our strings for output. However, when reading the external files, we tend to use the file object read methods.

Let me offer one final note on output considerations: Since we have only a relatively few generic processing steps for either extracting or building KBpedia, but ones that repeat across multiple modules or semantic object types, we will try to find ways to compose our file names from meaningful building block elements for consistency and understandability. We will start to see this in a fragmented way initially with our function and output definitions. When we get to the project-wide considerations, though, toward the concluding installments, we will be consolidating these fragments and building block considerations in a way that hopefully makes overall sense.

Using the CSV Module

Though CSV files are easy to generate, manage, and inspect, and there is a formal standard with RFC4180, actual implementation is more like the Wild West. Delimiters other than commas or tabs may be used (semi-colons, etc.) to separate values. Specific purposes may add local specifications, such as the ‘double pipe’ (‘||’) convention we have adopted for multiple entries in a cell. Treatment of quoted strings, including what to quote and how to quote, may differ between applications. We have also discussed the importance of standard encodings, the failure of which to use may lead to disastrous file corruptions.

CSV files in implementation have a standard layout of rows and columns, which is good, sometimes with headers and sometimes not. Though Microsoft Excel is a huge application for CSV files, Excel does not use UTF-8 as its standard and sometimes does other interesting things to its cell contents. It would be nice, for example, to have recognized templates that would enable us to move from one CSV environment to another. At minimum, we want to impose rigor and consistency to how we handle CSV files to prevent encoding mismatches or other discontinuities.

To help overcome some of these challenges we are using the Python csv module. Let’s first look and explore what functions this module has:

import csv

dir(csv)

Here are some of the attractive features of using the CSV module as our intermediary for data exchange. The CSV module:

Uses the same file object functions as standard Python, including an expanded csv.reader and csv.writer
Recognizes ‘dialects’, which are templates of processing specifications that can be defined or link to existing applications like Excel
Has a sniffer function to discover dialect regularities in new, wild files
Allows different quoting stringency levels to be set (all strings, multi-word strings, etc.)
Allows different delimiters to be set
Allows headers to be used or not
Recognizes field names for specific data columns
Enables Python dictionaries to mediate field names to master data.

Though we have not yet come to our ingestion (build) steps, when we do we will have need for some fields to iterate multiple items to store in a single field name and to process it with a different delimiter (‘||’). This and the master data dictionary aspects look promising.

Pandas and NumPy are additional CSV options with much more functionality should that be warranted.

Saving from Jupyter Notebook

To get output from your Jupyter Notebook, pick File → Download as to get full notebook outputs in nine different text-based formats and PDF. Of course, individual cells may also have their code blocks outputted via the Python functions discussed above.

Reading List and Additional Documentation

There are many fine online series and books and many excellent printed ones with basic Python documentation. Citations at the bottom of many of these CWPK installments have links to some of them.

If you are to follow this series closely I heartily recommend that you do so with a printed Python manual by your side that you can consult for specific commands or functions. I have spent some time looking, and have yet to find a single ‘go-to’ source for Python information. My most frequent sources are:

Eric Matthes, Python Crash Course, 2nd Edition, 2019. No Starch Press, San Francisco, CA, 530 pp. ISBN-10: 1-59327-928-0

Mark Lutz, Learning Python, 3rd Edition, 2008. O’Reilly Media, Inc., Sebastapol, CA, 706 pp. ISBN: 978-0-596-51398-6

David Beasley and Brian K. Jones, Python Cookbook: Recipes for Mastering Python 3, 3rd Edition, 2013. O’Reilly Media, Inc., Sebastapol, CA, 692 pp. ISBN: 978-1-449-34037-7

Bill Lubanovic, Introducing Python: Modern Computing in Simple Packages, 2nd Edition, 2020. O’Reilly Media, Inc., Sebastapol, CA, 602 pp. ISBN: 978-1-492-05136-7

Here are additional links useful to today’s CWPK installment:

Input and output from the official Python documentation
RealPython’s Guide to the Python print() Function
DataCamps’s Python print() function.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

NOTE: This CWPK installment is available both as an online interactive file

Posted:September 4, 2020

CWPK #30: Extracting Annotations

Everything Can Be Annotated in a Knowledge Graph

We’ve seen in the previous two installments of this Cooking with Python and KBpedia series various ways to specify a subset population for driving an iterative process for extracting structure from KBpedia. We’re going to retain that iterative approach, only change it now to extract annotations. Classes, properties, and instances (individuals) may all be annotated in OWL. We thus need to derive generalized approaches that can apply to any entity in a knowledge graph.

Annotations are information applied to a given entity in order to point to it, describe it, or identify it. As a best practices matter, there are certain fields we recommend be universally applied to annotate any given entity:

A preferred label (prefLabel) that is the standard name or title for a thing
A multiple of alternative labels (altLabel) that capture any of the ways a given thing may be referred to, including synonyms, acronyms, jargon, etc.
A definition of the thing (definition)
All labels should be tagged with a language tag in order to more readily support translation and use in multiple languages.

We may also find comments or notes associated with particular items. Further, in the case of object or data properties, we may have additional characterizations such as domain or range or functionality assigned to the item. We could have retrieved these characterizations as part of our structural extractions, but decided to include them rather in an annotation extraction pass (even though those characterizations are not annotative).

Items to be Extracted During Annotation Pass

We can thus assemble up a list of items that may be extracted during an annotation extraction pass. We could do these extractions in parts, since that is often the better approach during the inverse process of building our knowledge graph. However, given the number of annotation and related items that may be extracted, and the number of combinations of same, we decide as a matter of simplicity to extract all such information as a single record for each subject entity. We can later manipulate the large flat files so generated if we need to focus on subsets of them. We may revisit this question once we tackle the build side of this roundtripping process.

Some of the items that we will extract have multiple entries per subject. Parental class is one such item, as are alternative labels, which may number into the tens for a rather complete characterization. From our experience in the last installments we know we will need to set up some inner loops to accommodate such multiple entries. So, with these understandings, we can now compile up a list of items that may be extracted on an annotation extraction pass, including whether the item is limited to a single entry, or may have many:

IRI fragment name: single
prefLabel: single
altLabel: many
superclass: many
definition: single
editorialNote: many
mapping properties: many (a characterization that will grow over time)
comment: many
domain: single (object and data properties, only)
range: single (object and data properties, only)
functional type: single (object and data properties, only)

So, we decide to develop two variants of our code block. A standard one, and an expanded one that includes the object and data property additions. The IRI fragment name is the alias used internally in our Python programs and what gets concatenated with the base IRI to form the full IRI for the entity.

Also, to maintain the idea of a single line per subject entity, we decide that: 1) we will separate multiple entries for a given item with the ‘||’ (“double pipe”) separator, which we use because it is never used in the wild and it is easy to spot when scanning code; and 2) we will not use full IRIs in order to aid record readability.

(BTW, if we decide over time to add other standard characterizations to our items we will adjust our routines accordingly.)

Starting and Load

We again begin with our standard opening routine, except we have now substituted ‘kbpedia’ for ‘main’ in the first line, to make our reference going forward more specific:

kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# kbpedia = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'


from owlready2 import *
world = World()
kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Like always, we execute each cell as we progress down this notebook page by pressing shift+enter for the highlighted cell or by choosing Run from the notebook menu.

Basic Extraction Set-up

We tackle the smaller (non-property) variant of our code block first, treating the extracted items listed above as the members of a Python set specification. We also choose to prefix our variables with annot_. We will first start with a single item, foregoing the loop for the moment, to test if we have gotten our correspondences right. For the class set-up we’ll use the relatively small rc.Luggage class. (You may substitute any KBpedia RC as this item.)

s_item = rc.Luggage
annot_pref = s_item.prefLabel
annot_sup  = ''
# annot_sup  = s_item.superclass  # maybe it should be is_a
annot_alt  = s_item.altLabel
annot_def  = s_item.definition
annot_note = s_item.editorialNote
annot = [annot_pref, annot_sup, annot_alt, annot_def, annot_note]
print(annot)

[['baggage'], '', ['bag', 'bags', 'luggage'], ['This ‘class’ product category is drawn from UNSPSC, readily converted into other widely used systems. This product ‘class’ is an aggregate of luggage. This product category corresponds to the UNSPSC code: 53121500.'], []]

We need to add a few items to deal with specific property characteristics including domain, range, and functional type (which is blank in all of our cases):

s_item = kko.representations
annot_pref = s_item.prefLabel
annot_sup  = s_item.is_a
annot_dom  = s_item.domain
annot_rng  = s_item.range
annot_func = ''
annot_alt  = s_item.altLabel
annot_def  = s_item.definition
annot_note = item.editorialNote
annot = [annot_pref, annot_sup, annot_dom, annot_rng, annot_func, annot_alt, annot_def, annot_note]
print(annot)

[['representations'], [owl.AnnotationProperty], [], [], '', ['annotations', 'indexicals', 'metadata'], ['Pointers or indicators, including symbolic ones such as text or URLs, that draw attention to the actual or dynamic object.'], []]

KBpedia does not use functional properties at present. I leave a placeholder above, but have not worked out the owlready2 access methods.

Working Out the Code Block

A quick inspection of these outputs flags a few areas of concern. We see that items are often enclosed in square brackets (a set notation in Python), we have many quoted items, and we have (as we knew) mutiple entries for some fields, especially altLabel and parents. In order to test our code block out, we will need to have a test set loaded. We decide to keep on with rc.Luggage, but I throw in a length count. You can substitute any non-leaf RC into the code if you want a larger or smaller or different domain test set.

root = rc.Luggage
s_set=root.descendants()

len(s_set)

For the iteration part for the multiple entries, we begin with the code blocks used for the inner loops dealing with the structural backbone issues in CWPK #28 and CWPK #29. But the purpose of tracing inheritance is different than retrieving values for multiple attributes of a single entity. Maybe we should tackle what seems to be an easier concern to remove the enclosing brackets (‘[ ]’).

I also decide as we test out these code blocks that I would shorten the variable names to reduce the amount of typing and to reflect a more general procedure. So, all of the annot_ prefixes from above become a_.

Poking around I first find a string replacement example, followed by the .join method for strings:

for s_item in s_set:
  a_pref = s_item.prefLabel
  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_note = s_item.editorialNote
  a_     = [a_pref,a_sup,a_alt,a_def,a_note]
  def listToStringWithoutBrackets(a_):
    return str(a_).replace('[','').replace(']','')
  listToStringWithoutBrackets(a_)
  print( ','.join( repr(e) for e in a_ ) )
len(s_set)

I try another string substitution example with similarly disappointing results:

for s_item in s_set:
  a_pref = s_item.prefLabel
  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_note = s_item.editorialNote
  a_     = [s_item,a_pref,a_sup,a_alt,a_def,a_note]
  a_     = [str(i) for i in a_]
  a_out  = ','.join(a_).strip('[]') 
  print(a_out)

The reason these, and multiple other string approaches failed, was that we are dealing with results sets with multiple entries. It seemed like the safest way to ensure the fields were treated as strings was to explicitly declare them as such, and then manipulate the string directly. So, in the code below, we grab the property from the entity, convert it to a string, and then remove the first and last characters of the entire string, which in our case are the brackets. Note in this test code that I also (temporarily) comment out the two fields where we have possibly multiple items that we want to loop over and concatenate into a single string entry:

a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]                   # this is one way to remove opening and closing characters ([ ])
#  a_sup  = s_item.is_a
#  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')

We still see brackets in the listing, but those are for the two properties we commented out. All other brackets are now gone. While I really do not like repeating the same bracket removal code multiple times, it works, and after spending perhaps more time than I care to admit trying to find a more elegant solution, I decide to accept the workable over the perfect. I am hoping when we loop over the elements for the two fields commented out that we will be extracting the element from each pass of the loop, and thus via processiing will see the removal of their brackets. (That apparently is what happens in the loop steps below.)

Now, it is time to tackle the harder question of collapsing (also called ‘flattening’) a field with multiple entries. The basic idea of this inner loop is to treat all elements as strings, loop over the multiple elements, grab the first one and end if there is only one, but to continue looping if there is more than one until the number of elements is met, and to add a ‘double pipe’ (‘||’) character string to the previously built elements before concatenating the current element. This order of putting the delimiter at the beginning of each loop result is to make sure our final string with all concatenated results does not end with a delimiter. The skipping of the first pass means no delimiter is added at the beginning of the first element, also good if there is only one element for a given entity, which is often the case.

There are very robust for and while operators in Python. The one I settled on for this example uses an id,enumerate tuple where we get both the current element item and its numeric index:

a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
#  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  for a_id, a in enumerate(a_alt):            # here is the added inner loop as explained in text
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_alt  = a_item
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')

Now that the inner loop example is working we can duplicate the approach for the other inner loop and move on to putting a full working code block together.

Class Annotations

OK, so we appear ready to start finalizing the code block. We will start with class annotations because they have fewer fields to capture. The first step we want to do is to remove the pesky rc. namespace prefix in our output. Remember, this came from a tip in our last installment:

def render_using_label(entity):
    return entity.label.first() or entity.name

set_render_func(render_using_label)

(How to set it back to the default is described in the prior installment.)

We also pick a class and its descendants to use in our prototype example. I also add a len statement in the code to indicate how many classes we will be processing in this example:

root = rc.Luggage
s_set=root.descendants()

len(s_set)

We now expand our code block to set our initial iterator to an empty string, fix (remove) the brackets, and process the two inner loops of the altLabels and parent classes putting the “double pipe” (‘||’) between entries:

a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
  a_sup  = s_item.is_a
  for a_id, a in enumerate(a_sup): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_sup + '||' + str(a)
    a_sup  = a_item
  a_alt  = s_item.altLabel
  for a_id, a in enumerate(a_alt): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_alt + '||' + str(a)
    a_alt  = a_item
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')

EveningBag,'evening bag',Purse,evening bags,'The collection of all evening bags. A type of Purse. The collection EveningBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
Gucci,'Gucci',Luggage,GUCCI,'Gucci (/ɡuːtʃi/; Italian pronunciation: [ˈɡuttʃi]) is an Italian luxury brand of fashion and leather goods, part of the Gucci Group, which is owned by the French holding company Kering. Gucci was founded by Guccio Gucci in Florence in 1921.Gucci generated about €4.2 billion in revenue worldwide in 2008 according to BusinessWeek and climbed to 41st position in the magazine\'s annual 2009 \\"Top Global 100 Brands\\" chart created by Interbrand; it ranked retained that rank in Interbrand\'s 2014 index. Gucci is also the biggest-selling Italian brand. Gucci operates about 278 directly operated stores worldwide as of September 2009, and it wholesales its products through franchisees and upscale department stores. In the year 2013, the brand was valued at US$12.1 billion, with sales of US$4.7 billion.',
Briefcase,'briefcase',CarryingCase||Device-OfficeProduct-NonConsumable||OfficeProductMarketCategory||Box-Container,Attache cases||Attaché case||Attaché cases||Brief case||Handlebox||Portfolio (briefcase)||briefcases,'The collection of all briefcases, which are small portable cases designed to carry hardcopies of documents. A type of Luggage and Box_Container. The collection Briefcase is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType. This ‘commodity’ product category is drawn from UNSPSC, readily converted into other widely used systems. This commodity category is for briefcase. This product category corresponds to the UNSPSC code: 53121701.',
DuffelBag,'duffel bag',Luggage,Duffel bags||Duffle||Duffle Bag||Duffle bag||Dufflebag||Kit bag||Kit-bag||Seabag||The dufflebag||duffel bags,'The collection of all duffel bags. A type of Luggage. The collection DuffelBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
ShoulderBag,'shoulder bag',Luggage,shoulder bags,'The collection of all shoulder bags -- luggage bags with straps for carrying over the shoulder. A type of Bag and Luggage. The concept ShoulderBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
GolfBag,'golf bag',CarryingCase,golf bags,'Cylindrical bag about one meter long, open on one end; for carrying golf clubs. Has shoulder strap.',
SkiCarrier,'ski carrier',MechanicalDevice||ContainerArtifact,ski carriers,'The collection of all ski carriers. A type of ContainerArtifact and MechanicalDevice. The concept SkiCarrier is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
Backpack,'knapsack',PurseHandbagBag||SomethingToWear||Luggage,Back packs||Backbacks||Backpacks||Book bags||Book sack||Bookbags||Booksack||Day pack||Daypack||Haver sack||Haver sacks||Haversacks||Knap sack||Knap sacks||Knapsack||Pack sack||Pack sacks||Packsack||Packsacks||Ruck sack||Rucksack||Rucksacks||Schoolbag||backpack||backpacks||knapsacks||rucksack||rucksacks,'The collection of all backpacks. A type of Luggage, NonPoweredDevice, and SomethingToWear. The collection Backpack is an ArtifactTypeByFunction and a SpatiallyDisjointObjectType.',
Luggage,'baggage',TransportationContainerProduct,bags||luggage,'This ‘class’ product category is drawn from UNSPSC, readily converted into other widely used systems. This product ‘class’ is an aggregate of luggage. This product category corresponds to the UNSPSC code: 53121500.',
TravelAccessory,'travel accessory',Luggage,travel gear,'Instances are tangible products which are intended for use during travelling. Often they are lighter-weight, more-compact versions of products people normally use at home.',
DiaperBag,'diaper bag',ShoulderBag,Nappy bag||diaper bags,'The collection of all diaper bags. A DiaperBag is a bag which the caretaker of a baby may put Diapers, wipes, and other baby-related products into to bring with him or her when traveling with a HumanInfant or HumanToddler.',
LuggageTag,'luggage tag',TravelAccessory,Luggage tags||luggage tags,'The collection of all luggage tags. A type of Tag_IBO and TravelAccessory. The collection LuggageTag is an #$AerodromeCollection, a SpatiallyDisjointObjectType, and an ArtifactTypeByGenericCategory.',
WomensToteBag,'tote',PurseHandbagBag,satchels||totes||women's satchel||women's satchels||women's tote||women's tote bag||women's tote bags||women's totes,'The collection of all totes. A type of WomensBag. The collection WomensToteBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
LuggageSet,'luggage ensemble',Luggage,luggage set||luggage sets,'The collection of all luggage ensembles. A type of group of baggage, Artifact, and DurableGood. LuggageSet is an ArtifactTypeByGenericCategory.',
Suitcase,'suitcase',Luggage,Suitcases||Swedish lunchbox||Travel bag||Trolley case||Valise||grip||grips||suitcases||travelling bag||travelling bags||valise||valises,'Retangular piece of luggage with handle that is used when you are on a trip.',
PurseHandbagBag,'purse or handbag or bag',ContainerArtifact||LuggageHandbagPack,Clutch (handbag)||Evening bag||Hand bag||Hand-bag||Hand-bags||Handbags||Man bag||Man purse||Man-bag||Manbag||Manpurse,'This ‘class’ product category is drawn from UNSPSC, readily converted into other widely used systems. This product ‘class’ is an aggregate of purse or handbag or bag. This product category corresponds to the UNSPSC code: 53121600.',
PortfolioCase,'portfolio case',CarryingCase,portfolio cases||portfolios,'The collection of all portfolio cases, which are cases desinged for carrying portfolios of art or writing. A type of CarryingCase. The collection PortfolioCase is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
Purse,'purse',NonPoweredDevice||PersonalDevice,Purses||purses,"The collection of all purses. A type of women's clothing accessory, PersonalDevice, and WomensBag. The collection Purse is a SomethingToWearTypeByGenericCategory and a SpatiallyDisjointObjectType.",
ComputerBag,'computer carrying case',Luggage,computer bags||computer carrying cases||laptop bag||laptop bags||laptop carrying case||laptop carrying cases||laptop case||laptop cases,'The collection of all computer carrying cases. A type of accessories for laptop computers and CarryingCase. The collection ComputerBag is an ArtifactTypeByFunction and a SpatiallyDisjointObjectType.',
CarryOnLuggage,'carry on luggage',Luggage,carry on bag||carry on bags||carry-on||carry-on bag||carry-on baggage||carry-on bags||carry-on luggage,'The collection of all carry-on luggage. A type of Luggage. The collection CarryOnLuggage is an ArtifactTypeByFunction, a SpatiallyDisjointObjectType, and an #$AerodromeCollection.',
VoltageConverter,'voltage converter',TravelAccessory,voltage adapters||voltage adaptor||voltage adaptors||voltage converters,'The collection of all voltage converters. A type of TravelAccessory, ElectricalDevice, and travel accessory. VoltageConverter is a SpatiallyDisjointObjectType.',
CarryingCase,'carrying case',Luggage,carrying cases,'The collection of all carrying cases. A type of Luggage and Box_Container. The collection CarryingCase is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
GarmentBag,'hanging bag',Luggage,garment bags||hanging bags||suit bag,'The collection of all garment bags. A type of Luggage. The collection GarmentBag is an ArtifactTypeByFunction and a SpatiallyDisjointObjectType.',
CoinPurse,'coin purse',PurseHandbagBag,[],'This ‘commodity’ product category is drawn from UNSPSC, readily converted into other widely used systems. This commodity category is for coin purse. This product category corresponds to the UNSPSC code: 53121605.',
LuggageOrganizer,'luggage organizer',Organizer||ContainerArtifact,luggage organisers||luggage organizers,'This is the collection of products which fit into luggage whose purpose is to structure the space inside in such as way that the items stored therein are more organized and perhaps better protected.',

The routine now seems to be working how we want it, so we move on to accommodate the properties as well.

Property Annotations

Again, we set the renderer to the ‘clean’ setting and now pick a property and its sub-properties to populate our working set:

def render_using_label(entity):
    return entity.label.first() or entity.name

set_render_func(render_using_label)

root = kko.representations
p_set=root.descendants()

len(p_set)

This example has nearly 3000 sub-properties! That should make for an interesting example. We add our three new properties to the prior code block. We also make another change, which is to substitute the p_ prefix (for properties) over the prior s_ prefix for subject (classes or individual):

p_item = ''
for p_item in p_set:
  a_pref = p_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
  a_sup  = p_item.is_a
  for a_id, a in enumerate(a_sup): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_sup  = a_item
  a_dom  = p_item.domain
  a_dom  = str(a_dom)[1:-1]
  a_rng  = p_item.range
  a_rng  = str(a_rng)[1:-1]
  a_func = ''
  a_alt  = p_item.altLabel
  for a_id, a in enumerate(a_alt): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_alt  = a_item
  a_def  = p_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = p_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(p_item,a_pref,a_sup,a_dom,a_rng,a_func,a_alt,a_def,a_note, sep=',')

Fantastic! It seems that our basic annotation retrieval mechanisms are working properly.

You may have noted the sep=',' argument in the print statement. It means to add a comma separator between the output variables in the listing, a useful addition in Python 3 especially given our reliance on comma-separated value (CSV) files.

We are now largely done with the logic of our extractors. But, before we get to how to assemble the pieces in a working module, it is time for us to take a brief detour to learn about naming and writing output and saving to and reading from files. Since we will be using CSV files heavily, we also work that into next installment’s discussion.

Additional Documentation

The routines in this installment required much background reading and examples having to do with Python loops and string processing. Here are a few I found informative for today’s CWPK installment:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

NOTE: This CWPK installment is available both as an online interactive file

Posted:September 3, 2020

CWPK #29: Extracting Object and Data Properties

We Continue the Theme of Structural Extraction

In this installment of the Cooking with Python and KBpedia series, we continue the theme of extracting the structural backbone to KBpedia. Our attention shifts now from classes to properties, the predicates found in the middle of a subject – predicate – object semantic triple. A s-p-o triple is the basic assertion in the RDF and OWL languages.

There are three types of predicate properties in OWL. Object properties relate a subject to another named entity, one which may be found at an IRI address, local or on the Web. Data properties are a value characterization of the subject, and may be represented by strings (labels or text) or date, time, location, or numeric values. Data properties are represented by datatypes, not IRIs. Annotation properties are pointers or descriptors to the subject and may be either a datatype or an IRI, but there is no reasoning over annotations across subjects. All property types can be represented in hierarchies using a subPropertyOf predicate similar to subClassOf for classes.

In addition, KBpedia uses a triadic split of predicates based on the universal categories of Charles Sanders Peirce. These map fairly closely with the OWL splits, but with some minor differences (not important to our current processing tasks). Representations are pointers, indicators, or descriptors to the thing at hand, the subject. These map closely to the OWL annotation properties. Attributes are the characterizations of the subject, intensional in nature, and are predominantly data properties (though it is not a violation to assign object properties where a value is one of an enumerated list). Direct relations are extensional relations between two entities, where the object in s-p-o must be an object property, represented by an IRI.

In most of today’s practice subPropertyOf is little used, though KBpedia is becoming active in exploring this area. In terms of semantic inheritance, properties are classes, though with important distinctions. Object and data properties may have functional roles, restrictions as to the size and nature of their sets, and specifications as to what types of subject they may represent (domain) or what type of object they may connect (range).

Though supporting these restrictions, owlready2 has less robust support for properties than classes. In the last installment’s work on the class backbone we saw the advantage of the .descendant() method for collecting children or grandchildren throughout the subsumption tree of class descent. Owlready2 does not document or expose this method for properties, but with properties a sub-class of class in the owlready2 code, I found I could use many of the class methods. Woohoo!

What I outline below is a parallel structure extraction to what we saw in the last installment regarding classes. In the next installment we will transition from structure extraction to annotation extraction.

Starting and Load

We begin with our standard opening routine:

main = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# main = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Again, we execute each cell as we progress down this notebook page by pressing shift+enter for the highlighted cell or by choosing Run from the notebook menu.

Let’s first begin by inspecting the populated lists of our three types of properties, beginning with object (prefix po_), and then data (prefix pd_) and annotation (prefix pa_) properties, checking the length for the number of records as well:

po_set = list(world.object_properties())
list(po_set)

len(po_set)

pd_set = list(world.data_properties())
list(pd_set)

len(pd_set)

pa_set = list(world.annotation_properties())
list(pa_set)

len(pa_set)

You may want to Cell → All Output → Clear to remove these long listings from your notebook.

Getting the Subsets Right

When we inspect these lists, however, we see that many of the predicates are ‘standard’ ones that we have in our core KBpedia Knowledge Ontology (see the KKO image). Recall that our design has us nucleating our knowledge graph build efforts with a starting ontology. In KBpedia’s case that is KKO.

Now we could just build all of the properties each time from scratch. But, similar to our typology design for a modular class structure, we very much like our more direct mapping to predicates to Peirce’s universal categories.

So, we test whether we can use the same .descendants() approach we used in the prior installment, only now applied to properties. In the case of the annotations property, that corresponds to our kko.representations predicate. So, we test this:

root = kko.representations
pa_set=root.descendants()

len(pa_set)

We can see that we dropped 11 predicates that were in our first approach.

We can list the set and verify that nearly all of our descendant properties are indeed in the reference concept (rc) namespace (we will address the minor exceptions in some installments to come), so we have successfully separated our additions from the core KKO starting point:

list(pa_set)

Since I like keeping the core ontology design idea, I will continue to use this more specific way to set the roots for KBpedia properties for these extraction routines. It adds a few more files to process down the road, but it can all be automated and I am better able to keep the distnction between KKO and the specific classes and properties that populate it for the domain at hand. It does mean that all new properties introduced to the system must be made a rdfs:subPropertyOf of one of the tie-in roots, but that also enforces the explicit treatment of new properties in relation to the Peircean universal categories.

Under this approach, the root for annotation properties is kko.representations as noted. For object properties, the root is kko.predicateProperties. (The other two main branches are kko.mappingProeperties and skos.skosProperties, which we consider central to KKO.) For data properties, the root is kko.predicateDataProperties. The other data properties are also built in to KKO.

If one wanted to adopt the code base in this CWPK series for other purposes, perhaps with a different core or bootstrap, other design choices could be made. But this approach feels correct for the design and architecture of KBpedia.

Iterating Sub Properties

Now that we have decided this scope question, let’s try the final code block from the last installment (also based on .descendants() and is_a) so see if and how it works in the property context. We make two changes to the last installment routine in that we now specify the rdfs:subPropertyOf property and replace our iterated set with pa_set:

o_frag = set()
s_frag = set()
p_item = 'rdfs:subPropertyOf'
for s_item in pa_set:
  o_set = s_item.is_a
  for o_item in o_set:
     print(s_item,',',p_item,',',o_item,'.','\n', sep='', end='')
     o_frag.add(o_item)
  s_frag.add(s_item)

Great, again! Our prior logic is directly transferable. The nice thing about this code applied to properties is that we also get the specifications for creating a new property, useful when roundtripping the information for build routines.

So, we clear out the currently active cell and are ready to move on. But first, we also made some nice discoveries in working out today’s installment, so I will end today’s installment with a couple of tips.

Bonus Tip

While doing the research for this installment, I came across a nifty method within owlready2 for controlling how these extraction retrievals display, with full IRIs, namespaces, or not. First, run the original script for listing the pa_set above. Then, for contrast, try these two options:

def render_using_label(entity):
    return entity.label.first() or entity.name

set_render_func(render_using_label)
list(pa_set)

def render_using_iri(entity):
    return entity.iri

set_render_func(render_using_iri)
list(pa_set)

These two suggestions came from the owlready2 documentation. But, after trying them, I wanted to get back to the original (default) formatting. But the documentation is silent on this question. After poking through the code a bit, I found this initialization method for returning to the default. Again, try it:

set_render_func(default_render_func)
list(pa_set)

Bonus Tip #2

Here is a nice method for getting a listing of all of the properties applied to a given class:

rc.Mammal.get_class_properties()

Additional Documentation

Owlready2 property documentation, including for restrictions.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

NOTE: This CWPK installment is available both as an online interactive file

Posted:September 2, 2020

CWPK #28: Extracting Structure for Typologies

We Extract a Typology Scaffolding from an Active KG

In this installment of the Cooking with Python and KBpedia series, we work out in a Python code block how to extract a single typology from the KBpedia knowledge graph. To refresh your memory, KBpedia has an upper, ‘core’ ontology, the KBpedia Knowledge Ontology (KKO) that has a bit fewer than 200 top-level concepts. About half of these concepts are connecting points we call ‘SuperTypes’, that also function as tie-in points to underlying tree structures of reference concepts (RCs). (Remember there are about 58,000 RCs across all of KBpedia.)

We call each tree structure a ‘typology’, which has a root concept that is one of the upper SuperType concepts. The tree structures in each typology are built from rdfs:subClassOf relations, also known as ‘is-a‘. The typologies range in size from a few hundred RCs to multiple thousands in some cases. The combination of the upper KKO structure and its supporting 70 or so typologies provide the conceptual backbone to KBpedia. We discussed this general terminology in our earlier CWPK #18 installment.

Each typology extracted from KBpedia can be inspected as a standalone ontology in something like the Protégé IDE. Typologies can be created or modified offline and then imported back into KBpedia, steps we will address in later installments. The individual typologies are modular in nature, and a bit easier to inspect and maintain when dealt with independently of the entire KBpedia structure.

Starting and Load

We begin with our standard opening routine, though we are a bit more specific about identifying prefixes in our name spaces:

main = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# main = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Like always, we execute each cell as we progress down this notebook page by pressing shift+enter for the highlighted cell or by choosing Run from the notebook menu.

We will start by picking one of our smaller typologies on InquiryMethods since its listing is a little easier to handle than one of the bigger typologies (such as Products or Animals). Unlike most all of the other RCs which are labeled in the singular, note we use plural names for these SuperType RCs.

The SuperType is also the ‘root’ of the typology. What we are going to do is use the owlready2 built-in descendants() method for extracting out a listing of all children, grandchildren, etc., starting with our root. (Another method, ancestors() navigates in the opposite direction to grab parents, grandparents, etc., all the way up to the ultimate root of any OWL ontology, owl:Thing.) Note in these commands that we are also removing the starting node from our listing as shown in the last statement:

root = kko.InquiryMethods
s_set=root.descendants()
s_set.remove(root)

* Owlready2 * Warning: ignoring cyclic subclass of/subproperty of, involving:
  http://kbpedia.org/kko/rc/Cognition
  http://kbpedia.org/kko/rc/AnimalCognition

Owlready2 has an alternate way to not include the starting class in its listing, using the include_self = False argument. You may want to clear your memory to test this one:

root = kko.InquiryMethods
s_set=root.descendants(include_self = False)

We can then see the members of s_set:

list(s_set)

[rc.DriverVisionTest,
 rc.StemCellResearch,
 rc.AnalyticNumberTheory,
 rc.ComputationalGroupTheory,
 rc.HeuristicSearching,
 rc.MedicalResearch,
 rc.Comparing,
 rc.YachtDesign,
 rc.PGroups,
 rc.SolarSystemModel,
 rc.AirNavigation,
 rc.CriticismOfMarriage,
 rc.ScientificObservation,
 rc.PokerStrategy,
 rc.MesoscopicPhysics,
 rc.Reasoning,
 rc.SalesContractNegotiation,
 rc.SocraticDialogue,
 rc.ArgumentFromMorality,
 rc.GramStainTest,
 rc.Checking-Evaluating,
 rc.TwinStudies,
 rc.ComputationalNumberTheory,
 rc.Surveillance,
 rc.MethodsOfProof,
 rc.InfiniteGroupTheory,
 rc.Examination-Investigation,
 rc.MedicalEvaluationWithImaging,
 rc.Diagnosing,
 rc.TragedyOfTheCommons,
 rc.Survey,
 rc.RepresentationTheory,
 rc.SportsTraining,
 rc.CelestialNavigation,
 rc.Metatheorem,
 rc.ModelingAndSimulation,
 rc.CriticismOfMormonism,
 rc.QuantumPhase,
 rc.Evaluating,
 rc.LatticeModel,
 rc.BreastCancerScreening,
 rc.SolvingAProblem,
 rc.NetworkTheory,
 rc.AnalyzingSomething,
 rc.TransfiniteCardinal,
 rc.PointGroup,
 rc.CriminalInvestigation,
 rc.AuthenticationEvent,
 rc.FailingSomething,
 rc.BargainingTheory,
 rc.AdministrativeCourt,
 rc.Circumnavigation,
 rc.AcademicTesting,
 rc.CriticismOfTheUnitedNations,
 rc.ScientificTheory,
 rc.NavalIntelligence,
 rc.InterpretationsOfQuantumMechanics,
 rc.AtomicModel,
 rc.UndercoverOperation-LawEnforcement,
 rc.HearingTest,
 rc.IntegerSequence,
 rc.ThoughtExperimentsInQuantumMechanics,
 rc.Models,
 rc.AdditiveCategory,
 rc.UnitedStatesDiplomaticCablesLeak,
 rc.CausalFallacy,
 rc.ResearchEthics,
 rc.VerificationOfCredit,
 rc.FundamentalStockAnalysis,
 rc.Gentrification,
 rc.EvolutionaryGameTheory,
 rc.CategoryTheoreticCategory,
 rc.Geolocation,
 rc.WeaponsTesting,
 rc.AtmosphericDispersionModeling,
 rc.FilmCriticismOnline,
 rc.MathematicalTheory,
 rc.ProbabilityAssessment,
 rc.SetTheory,
 rc.MathematicalQuantization,
 rc.RapidStrepTest,
 rc.Contrast,
 rc.ForensicToxicology,
 rc.RandomGraph,
 rc.MedicalTesting,
 rc.MonteCarloMethod,
 rc.CategoricalLogic,
 rc.PopulationModel,
 rc.CognitiveBias,
 rc.AmericanCollegeTestingProgramAssessment,
 rc.VettingASource,
 rc.TomographyScan,
 rc.BodyFarm,
 rc.ClosedCategory,
 rc.EurovisionSongThatScoredNoPoints,
 rc.TheoreticalPhysics,
 rc.CosmologicalSimulation,
 rc.StochasticProcess,
 rc.NonlinearSystem,
 rc.HiddenVariableTheory,
 rc.SurveillanceScandal,
 rc.DrugTestWithUrine,
 rc.LatticePoint,
 rc.GraduateManagementAdmissionTest,
 rc.SystemsThinking,
 rc.NeutralBuoyancyTraining,
 rc.ClinicalHumanDrugTrial,
 rc.ProbabilityInterpretation,
 rc.ScientificModeling,
 rc.InductiveInferenceProcess,
 rc.TheoryOfProbabilityDistribution,
 rc.UrbanExploration,
 rc.SchroedingerEquation,
 rc.ChoiceModelling,
 rc.MedicalResearchProject,
 rc.MedicalPhotographyAndIllustration,
 rc.AuditingFinancialRecords,
 rc.ClinicalTrial,
 rc.ElementaryNumberTheory,
 rc.DaggerCategory,
 rc.RealTimeSimulation,
 rc.SyntheticApertureRadar,
 rc.VerificationOfTruth,
 rc.LocalAuthoritySearch,
 rc.BiomedicalResearchService,
 rc.RequestingInformation,
 rc.DualityTheory,
 rc.FiniteModelTheory,
 rc.CriticismOfIslamism,
 rc.TheoryOfGravitation,
 rc.FinancialRatio,
 rc.QuantumMeasurement,
 rc.MedicalUltrasonography,
 rc.Experimenting,
 rc.ForensicPhotography,
 rc.ModularArithmetic,
 rc.GroupAutomorphism,
 rc.JobInterview,
 rc.SatelliteMeteorologyAndRemoteSensing,
 rc.PathologyResearchService,
 rc.Functor,
 rc.RobotNavigation,
 rc.Evaluation,
 rc.HiddenMarkovModel,
 rc.CriticismOfMonotheism,
 rc.RegressionDiagnostic,
 rc.ExteriorInspection,
 rc.PositronEmissionTomography,
 rc.QuadraticForm,
 rc.ForensicEntomology,
 rc.UniversalAlgebra,
 rc.WebBasedSimulation,
 rc.PropositionalFallacy,
 rc.Staring,
 rc.HumanAttributeTesting,
 rc.BritishNuclearTestsAtMaralinga,
 rc.HigherCategoryTheory,
 rc.Intention,
 rc.PreclassicalEconomics,
 rc.AbductiveInferenceProcess,
 rc.NonparametricRegression,
 rc.DrugTest,
 rc.ModularForm,
 rc.FoundationalQuantumPhysics,
 rc.SimulationSoftware,
 rc.Radiography,
 rc.DiracEquation,
 rc.GraduateRecordExamination,
 rc.FreeAlgebraicStructure,
 rc.PsychiatricModel,
 rc.ClinicalResearch,
 rc.VerificationOfEmployment,
 rc.DrugEvaluation,
 rc.DecisionTheory,
 rc.LimitsCategoryTheory,
 rc.CriticalThinking,
 rc.WoodenArchitecture,
 rc.RegressionWithTimeSeriesStructure,
 rc.TheoryOfRelativity,
 rc.Rejecting-CommunicationAct,
 rc.Thinking-NonPurposeful,
 rc.InfiniteGraph,
 rc.ScientificMethod,
 rc.Scrutiny,
 rc.TechnologyDevelopment,
 rc.CuringADisease,
 rc.GaugeTheory,
 rc.DigitalForensics,
 rc.HomologicalAlgebra,
 rc.LatentVariableModel,
 rc.LegalReasoning,
 rc.BiblicalCriticism,
 rc.AutomaticIdentificationAndDataCapture,
 rc.PerformanceReview,
 rc.Morphism,
 rc.LanguageModeling,
 rc.CriticismOfCreationism,
 rc.RobustRegression,
 rc.PsychologicalTesting,
 rc.Discipline,
 rc.ElectroweakTheory,
 rc.DeductiveInferenceProcess,
 rc.ProbabilityFallacy,
 rc.Remedy,
 rc.AlternativesToAnimalTesting,
 rc.Parastatistics,
 rc.Verification,
 rc.MedicalCollegeAdmissionTest,
 rc.NeuropsychologicalTest,
 rc.BirdWatching,
 rc.InformationAnalysis,
 rc.MassIntelligenceGatheringSystem,
 rc.Census,
 rc.Negotiating,
 rc.TheoryOfConstraints,
 rc.CriticismOfWelfare,
 rc.RegressionVariableSelection,
 rc.TypeTheory,
 rc.GroupTheory,
 rc.IntegrableSystem,
 rc.PublicOwnership,
 rc.ChildrensLiteratureCriticism,
 rc.Evidence,
 rc.Declaring-Evaluating,
 rc.ExperimentalMedicineService,
 rc.Supersymmetry,
 rc.BusinessIntelligence,
 rc.SubgroupProperty,
 rc.QuantumLatticeModel,
 rc.ArchitecturalElement,
 rc.NuclearProgram,
 rc.RejectingSomething,
 rc.ErgodicTheory,
 rc.SheafTheory,
 rc.ThoughtExperimenting,
 rc.MakingAPlan,
 rc.NewCriticism,
 rc.AutomaticNumberPlateRecognition,
 rc.ComputerModeling,
 rc.StatisticalOutlier,
 rc.SelfOrganization,
 rc.StandardModel,
 rc.QuantumOptics,
 rc.Simulation-Activity,
 rc.Modeling,
 rc.DatabaseSearching,
 rc.CivilianChemicalResearchProgram,
 rc.FinancialRiskEvaluation,
 rc.HIVVaccineResearch,
 rc.Exploration,
 rc.MoonshineTheory,
 rc.PrerogativeWrit,
 rc.Criticism,
 rc.Argument,
 rc.ProbabilityTheoryParadox,
 rc.ToposTheory,
 rc.CreditScoring,
 rc.VisualThinking,
 rc.TheoryOfDeduction,
 rc.TheatreCriticism,
 rc.InspectingOfHome,
 rc.AxiomOfSetTheory,
 rc.PauliExclusionPrinciple,
 rc.WatchingSomething,
 rc.EnergyDevelopment,
 rc.EmailAuthentication,
 rc.StoolTest,
 rc.IntelligenceAnalysisProcess,
 rc.BasicConceptsInSetTheory,
 rc.IntelligenceGathering,
 rc.CombinatorialGroupTheory,
 rc.SpinModel,
 rc.Deontic-AgencyReasoning,
 rc.ArchitecturalTheory,
 rc.ArgumentsForTheExistenceOfGod,
 rc.LogicalFallacy,
 rc.GraduateSchoolEntranceTest,
 rc.AlgebraicGraphTheory,
 rc.Imagination,
 rc.BusinessProcessModelling,
 rc.CriticismOfJehovahsWitnesses,
 rc.AlternativeMedicalDiagnosticMethod,
 rc.CategoryTheory,
 rc.Apprenticeship,
 rc.GraphRewriting,
 rc.InternetSearching,
 rc.GenomeProject,
 rc.UrineTest,
 rc.PerformanceTesting,
 rc.IntelligenceTest,
 rc.ProductRecall,
 rc.Inquiry,
 rc.HypothesisTesting,
 rc.ResearchProject,
 rc.TypeOfScientificFallacy,
 rc.Swarming,
 rc.ComputationalProblemsInGraphTheory,
 rc.TheoryOfCryptography,
 rc.TRIZ,
 rc.PhilosophicalTheory,
 rc.ChaoticMap,
 rc.GraphTheory,
 rc.TestDrive,
 rc.MagneticMonopole,
 rc.NuclearPhysics,
 rc.MilitaryChemicalWeaponsProgram,
 rc.FairIsaacCreditScoring,
 rc.RevealingTrueInformation,
 rc.ResearchAndDevelopment,
 rc.Canceling-Declaring-Evaluating,
 rc.OilfieldProductionModel,
 rc.ElectronicStructureMethod,
 rc.Teleportation,
 rc.ComputerModel,
 rc.MethodsInSociology,
 rc.Testimony,
 rc.ProposedEnergyProject,
 rc.TelevisionProgramming,
 rc.ProblemSolving,
 rc.FloatingArchitecture,
 rc.ResearchByField,
 rc.MonoidalCategory,
 rc.Explanation-Thinking,
 rc.EconomicsTheorem,
 rc.CriticismOfTheBible,
 rc.CohortStudyMethod,
 rc.FormalTheoriesOfArithmetic,
 rc.InventingSomething,
 rc.LanglandsProgram,
 rc.QuantumState,
 rc.WitchHunt,
 rc.AnthropologicalStudy,
 rc.SocialConstructionism,
 rc.Counting,
 rc.MedicalEthics,
 rc.PhenomenologicalMethodology,
 rc.FunctionalSubgroup,
 rc.EconomicTheory,
 rc.Skepticism,
 rc.FrenchLiteraryCriticism,
 rc.OpenProblem,
 rc.ScientificTechnique,
 rc.ProbabilityTheorem,
 rc.ObjectCategoryTheory,
 rc.MarketFailure,
 rc.FinancialChart,
 rc.ReconnaissanceInForce-MilitaryOperation,
 rc.Consumption-Economics,
 rc.ArchitecturalDesign,
 rc.NumberTheory,
 rc.MagicalThinking,
 rc.MultiplicativeFunction,
 rc.AtomicPhysics,
 rc.RegressionAnalysis,
 rc.DreamInterpretation,
 rc.GaloisTheory,
 rc.ClinicalPsychologyTest,
 rc.TermLogic,
 rc.ArchitectureRecord,
 rc.ResearchAdministration,
 rc.ComputerSurveillance,
 rc.BiochemistryMethod,
 rc.NuclearIsomer,
 rc.DempsterShaferTheory,
 rc.ExtensionsAndGeneralizationsOfGraphs,
 rc.Thought,
 rc.PolarExploration,
 rc.UrbanRenewal,
 rc.ConsistencyModel,
 rc.AppliedLearning,
 rc.CriticalPhenomena,
 rc.DensityFunctionalTheory,
 rc.EnergyModel,
 rc.Magnification-Process,
 rc.Inspecting,
 rc.GeometricGroupTheory,
 rc.CognitiveTest,
 rc.Architecture,
 rc.ArchitecturalCommunication,
 rc.OffenderProfiling,
 rc.MassSurveillance,
 rc.RandomMatrix,
 rc.ExtremalGraphTheory,
 rc.PolynesianNavigation,
 rc.Voyage,
 rc.EconometricModel,
 rc.SemiempiricalQuantumChemistryMethod,
 rc.Reliabilism,
 rc.LearningThat,
 rc.Spinor,
 rc.PerturbationTheory,
 rc.Investigation,
 rc.ExactlySolvableModel,
 rc.CommunicationOfFalsehood,
 rc.SocialResearch,
 rc.CannabisResearch,
 rc.CardinalNumber,
 rc.UrbanAndRegionalPlanning,
 rc.ArchitecturalCompetition,
 rc.SearchAndSeizure,
 rc.GeometricGraphTheory,
 rc.ChartPattern,
 rc.AgeOfDiscovery,
 rc.SustainableArchitecture,
 rc.SubstanceTheory,
 rc.StatisticalFieldTheory,
 rc.Hypothesis,
 rc.Research,
 rc.ModelTheory,
 rc.EnvironmentalResearch,
 rc.SocialEngineering-PoliticalScience,
 rc.Electrocardiogram,
 rc.CancerResearch,
 rc.Determinacy,
 rc.IntelligenceTesting,
 rc.QuantumModel,
 rc.Negotiation,
 rc.AnimalTesting,
 rc.Crystallizing,
 rc.GraphColoring,
 rc.CandlestickPattern,
 rc.ScientificExploration,
 rc.BuildingInformationModeling,
 rc.RadarNetwork,
 rc.ForensicScience,
 rc.LearningByDoing,
 rc.DescriptiveSetTheory,
 rc.FramingSocialSciences,
 rc.ResearchMethod,
 rc.ContractNegotiation,
 rc.Theorizing,
 rc.SocialEngineering-Security,
 rc.MammographyExam,
 rc.MilitaryNuclearWeaponsProgram,
 rc.ForcingMathematics,
 rc.ConceptualDistinction,
 rc.BridgeDesign,
 rc.CollegeEntranceTest,
 rc.GraphConnectivity,
 rc.Amniocentesis,
 rc.GeneralizedLinearModel,
 rc.MedicalImaging,
 rc.Memorizing,
 rc.DiophantineEquation,
 rc.ScholasticAptitudeTest,
 rc.FirstOrderMethod,
 rc.MineralModel,
 rc.Bargaining,
 rc.MilitaryWMDProgram,
 rc.PapSmearTest,
 rc.InnerModelTheory,
 rc.ElectronicDataSearching,
 rc.ConceptualAbstraction,
 rc.CensusInPeru,
 rc.LandscapeArchitecture,
 rc.Voyeurism,
 rc.LawSchoolAdmissionTest,
 rc.GraphEnumeration,
 rc.ControllingSomething-Experimenting,
 rc.BloodPressureTest,
 rc.EstimationTheory,
 rc.NuclearWeaponsTesting,
 rc.AnomaliesInPhysics,
 rc.ForensicMeteorology,
 rc.RevealingInformation,
 rc.LogLinearModel,
 rc.StringBasedSearching,
 rc.PregnancyTest,
 rc.MeasureSetTheory,
 rc.IntelligenceGatheringDiscipline,
 rc.VetoingSomething,
 rc.AchievementTest,
 rc.Ordering,
 rc.TheoryOfAging,
 rc.NavalArchitecture,
 rc.Psychopathy,
 rc.GoOpening,
 rc.GraphMinorTheory,
 rc.MaritimePilotage,
 rc.TrueOrFalseTest,
 rc.MarkovModel,
 rc.VideoSurveillance,
 rc.QuantumFieldTheory,
 rc.FieldResearch,
 rc.GameTheory,
 rc.LearningToRead,
 rc.ConformalFieldTheory,
 rc.StochasticModel,
 rc.OrnithologicalEquipmentOrMethod,
 rc.EyeContact,
 rc.ThroatCultureTest,
 rc.Niche,
 rc.OrdinalNumber,
 rc.EngineProblem,
 rc.Polytely,
 rc.ScientificControl,
 rc.ReligiousArchitecture,
 rc.ProbabilisticArgument,
 rc.InfraredImaging,
 rc.Aleph-1,
 rc.ExoticProbability,
 rc.GraphOperation,
 rc.RealEstateValuation,
 rc.DiastolicBloodPressureTest,
 rc.MathematicalModeling,
 rc.UrbanPlanning,
 rc.CIAActivitiesInTheAmericas,
 rc.QuantumMechanics,
 rc.BiologicalWeaponsTesting,
 rc.Matching,
 rc.Theories,
 rc.Bias,
 rc.AstronomyProject,
 rc.InternationalCriminalCourtInvestigation,
 rc.LifeExtension,
 rc.IndependenceResult,
 rc.CounterIntelligence,
 rc.MemoryTest,
 rc.MediaProgramming,
 rc.TheoreticalBiology,
 rc.TeleologicalArgument,
 rc.GeochronologicalDatingMethod,
 rc.LeastSquares,
 rc.GraphInvariant,
 rc.ChartOverlay,
 rc.KnowledgeSharing,
 rc.EyeTest,
 rc.OilfieldDrillingModel,
 rc.FormalMethod,
 rc.HolonomicBrainTheory,
 rc.LanguageAcquisition,
 rc.StringTheory,
 rc.Rationalization,
 rc.DeterminingInterrelationship,
 rc.Appraising,
 rc.AlzheimersDiseaseResearch,
 rc.SetTheoreticUniverse,
 rc.PersonalityTesting,
 rc.DiscoveringSomething,
 rc.TheoreticalChemistry,
 rc.ProbabilisticModel,
 rc.DeductiveReasoning,
 rc.ComputerSimulation,
 rc.RegressionAndCurveFittingSoftware,
 rc.TechnicalIndicator,
 rc.EconomicsQuantitativeMethod,
 rc.ThyroidologicalMethod,
 rc.DiophantineApproximation,
 rc.Identification,
 rc.Analysis,
 rc.ChaosTheory,
 rc.Comparison-Examination,
 rc.MilitaryBiologicalWeaponsProgram,
 rc.SystemsOfSetTheory,
 rc.PersonalityTest,
 rc.Practicing-Preparing,
 rc.MathematicalEconomics,
 rc.SyllogisticFallacy,
 rc.MacroeconomicsAndMonetaryEconomics,
 rc.Thinking,
 rc.BusinessModel,
 rc.DynamicSystemsDevelopmentMethod,
 rc.SpecialRelativityMt,
 rc.GraphTheoryObject,
 rc.ForensicPathology,
 rc.OilfieldEconomicModel,
 rc.Simulation,
 rc.Syllogism,
 rc.AstronomySurvey,
 rc.Urelement,
 rc.RorschachTest,
 rc.AdministrativeHearing,
 rc.ComputabilityTheory,
 rc.ForestModelling,
 rc.Kantianism,
 rc.Biosimulation,
 rc.CentralLimitTheorem,
 rc.ProbabilityTheory,
 rc.GreatNorthernExpedition,
 rc.SpaceGroup,
 rc.LearningMethod,
 rc.Counterintelligence,
 rc.ChemicalWeaponsTesting,
 rc.ArithmeticFunction,
 rc.Superstring,
 rc.RemoteSensing,
 rc.ArgumentsAgainstTheExistenceOfGod,
 rc.MedicalScience,
 rc.Wellfoundedness,
 rc.InvalidatingSomething,
 rc.TerroristPlot,
 rc.InductiveReasoning,
 rc.LargeDeviationsTheory,
 rc.UniversityEntryTest,
 rc.Observing,
 rc.MammographicBreastCancerScreening,
 rc.QuantumBiology,
 rc.InformationGathering,
 rc.ConceptualModel,
 rc.SocialEngineering,
 rc.DomainDecompositionMethod,
 rc.CholesterolTest,
 rc.ContinuedFraction,
 rc.ForensicAnthropology,
 rc.RoboticsProject,
 rc.InductiveFallacy,
 rc.PsychiatricResearch,
 rc.GameArtificialIntelligence,
 rc.Interviewing,
 rc.AbelianGroupTheory,
 rc.StatisticalModel,
 rc.ComputationalLearningTheory,
 rc.CriticismOfAtheism,
 rc.Designing,
 rc.HilbertSpace,
 rc.Wiretap,
 rc.SurveyMethodology,
 rc.HIVTest,
 rc.SchoolOfThought,
 rc.GeometryOfNumbers,
 rc.ForensicPalynology,
 rc.CivilianEnergyProgram,
 rc.ReligiousCriticism,
 rc.SystolicBloodPressureTest,
 rc.Navigating,
 rc.ChessTheory,
 rc.PublicInquiry,
 rc.PreliminaryHearing,
 rc.Productivity,
 rc.CriticismOfCapitalism,
 rc.ProbabilisticInequality,
 rc.DrugTestWithBlood,
 rc.BloodTest,
 rc.Annulment,
 rc.CrossExamination,
 rc.CivilianBiogeneticsProgram,
 rc.BreastExam,
 rc.Hearing-LegalProceeding,
 rc.ForensicPsychology,
 rc.AlgebraicNumberTheory,
 rc.Zero-Number,
 rc.PoliticalEconomicModel,
 rc.MagneticResonanceImaging,
 rc.CriticismOfBullfighting,
 rc.TechnicalStockAnalysis,
 rc.CombinatorialGameTheory,
 rc.CreditScore-UnitedStates,
 rc.AidsToNavigation,
 rc.PersonalityTheory,
 rc.CriticismOfFeminism,
 rc.LiverFunctionTest,
 rc.StettingSomething]

After doing some counts (len(s_set) for example) and inspections of the list, we determine that the code block so far is providing the entire list of sub-classes under the root. Now we want to start formatting our output similar to the flat files we are using. We begin by prefixing our variable names with s_, p_, o_ to correspond to our subject – predicate – object triples close to the native N3 format. We’ll continue to see this pattern over multiple variables in multiple code blocks for multiple installments.

We also set up an iterator to loop over the s_set, generating an s_item for each element encountered in the list. We add a print to generate back to screen each line:

o_frag = list()
s_frag = list()
p_item = 'rdfs:subClassOf'
for s_item in s_set:
   o_item = s_item.is_a
   print(s_item,p_item,o_item)

Hmm, we see many of the o_item entries are in fact sets with more than one member. This means, of course, that a given entry has multiple parents. For input specification purposes, each one of those variants needs to have its own triple assertion. Thus, we also need to iterate over the o_set entries to generate another single assignment. So, we need to insert another for iteration loop, and indent it as Python expects. Notice, too, that the calls within these loops all terminate with a ‘:’.

o_frag = list()
s_frag = list()
p_item = 'rdfs:subClassOf'
for s_item in s_set:
   o_set = s_item.is_a
   for o_item in o_set:
       print(s_item,p_item,o_item)
       o_frag.append(o_item)
       s_frag.append(s_item)

We test with the length (len) argument to see if we have picked up items.

len(o_frag)

Hmmm, that’s not good. The size of o_frag and s_frag are showing to be the same, but we already saw there were multiple objects for the subjects. Clearly, we’re still not counting and processing this right.

So, we need to make two final changes to this routine. First, we want to get the population of our sets correct. We can see in our prior example that we were counting o_frag and s_frag as part of the same loop, but that is not correct. The s_frag needs to be linked with processing the subject set. We change the indent to assign this correctly. (Testing this may require you to Kernel → Restart & Clear Output and then running all of the above cells.)

The second change we want is for our output to begin to conform to a CSV file with leading and trailing white spaces removed and entries separated by commas, moving us again toward a N3 format. Here are the resulting changes:

o_frag = set()
s_frag = set()
p_item = 'rdfs:subClassOf'
for s_item in s_set:
   o_set = s_item.is_a
   for o_item in o_set:
       print(s_item,',',p_item,',',o_item,'.','\n', sep='', end='')
       o_frag.add(o_item)
   s_frag.add(s_item)

Getting rid of the leading and training white spaces is a little tricky. Indeed the sep ='' argument above is not yet widely used since it was only recently added to Python. Versions 3.3 or earlier do not support this argument and would fail. Since I have no legacy Python code I can afford to rely on the latest versions of the language. But little nuances such as this are something to be aware of as you research various methods, commands and arguments.

We can also check counts again to ensure everything is now correct:

len(s_frag)

And we can start playing around with some of the set methods, in this case the .intersection between our too sets:

len(o_frag.intersection(s_frag))

This is all looking pretty good, though we have not yet dealt with putting the full URIs into the triples. That is straightforward so we can afford to put that off until we are ready to generate the actual typologies. But we realize we also have missed one final piece of the logic necessary to have our typologies readable as separate ontologies: declaring all of our classes as such under the standard owl:Thing. These new classes correspond to each of the entries in the s_frag set, so we add another line in a print statement to do so.

o_frag = set()
s_frag = set()
p_item = 'rdfs:subClassOf'
new_class = 'owl:Thing'
for s_item in s_set:
   o_set = s_item.is_a
   for o_item in o_set:
       if o_item in s_set:
         print(s_item,',',p_item,',',o_item,'.','\n', sep='', end='')
         o_frag.add(o_item)
   s_frag.add(s_item)
   print(s_item,',','a',',',new_class,'.','\n', sep='', end='')
len(s_frag)

Great, our logic appears correct and our counts do, too. So we can consider this code block as developed enough for assembly into a formal method and then module. Let’s now move on to prototyping other components in the KBpedia structure.

Additional Documentation

Here are some other interactive resources related to today’s CWPK installment:

Nice Stack Overflow discussion
2D lists
Arrays.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

NOTE: This CWPK installment is available both as an online interactive file

Main Links

Search

It is Time to Explore Python Dictionaries and Packaging

The Basic Iteration Approach

Starting and Load

Creating the Dictionaries

Marrying Iterators and Routines

Additional Documentation

We Are Now Generating Info That Requires Persistence

File Object and Methods

The print() Function

Open, Close and Reading Files

Output Options

Writing Files

Using the CSV Module

Saving from Jupyter Notebook

Reading List and Additional Documentation

Everything Can Be Annotated in a Knowledge Graph

Items to be Extracted During Annotation Pass

Starting and Load

Basic Extraction Set-up

Working Out the Code Block

Class Annotations

Property Annotations

Additional Documentation

We Continue the Theme of Structural Extraction

Starting and Load

Getting the Subsets Right

Iterating Sub Properties

Bonus Tip

Bonus Tip #2

Additional Documentation

We Extract a Typology Scaffolding from an Active KG

Starting and Load

Additional Documentation