These are Two Recommended Utilities
This installment in our Cooking with Python and KBpedia series covers two useful (essential?) utilities for any substantial project:
stats refers to internal program or knowledge graph metrics, not a generalized statistical analysis package.
logging is a longstanding Python module that provides persistence and superior control over using simple
On the stats side, we will emphasize capturing metrics not already available when using Protégé, which provides its own set of useful baseline statistics. (See Figure 1.) These metrics are mostly simple counts, with some sums and averages. The results of these metrics are some of the numerical data points that we will use in the next installment on charting.
On the logging front, we will edit all of our existing routines to log to file, as well as print to screen. We can embed these routines in existing functions so that we may better track our efforts.
An Internal Stats Module
In our earlier extract-and-build routines we have already put in place the basic file and set processing steps necessary to capture additional metrics. We will add to these here, in the process creating an internal
stats module in our cowpoke package.
First, there is no need to duplicate the information that already comes to us when using Protégé. Here are the standard stats provided on the main start-up screen:
We are loading up here (1) our KBpedia v 300 in-progress. We can see that Protégé gives us counts (2) of classes (58200), object properties (1316), data properties (802), annotation properties (2919), and a few other metrics.
We will take these values as givens, and will enter them as part of the initialization for our own internal procedures (for checking totals and calculating percentages).
Pyflakes is a simple Python code checker that you may want to consider. If you want to add in stylistic checks, you want flake8, which combines Pyflakes with style checks against PEP 8 or pycodestyle. Pylint is another static code style checker.
from cowpoke.__main__ import * from cowpoke.config import *
### KEY CONFIG SETTINGS (see build_deck in config.py) ### # 'kb_src' : 'standard' # count : 14 # Note 1 # out_file : 'C:/1-PythonProjects/kbpedia/v300/targets/stats/kko_typol_stats.csv' from itertools import combinations # Note 2 def typol_stats(**build_deck): = typol_dict.values() kko_list = build_deck.get('count') count = build_deck.get('out_file') out_file with open(out_file, 'w', encoding='utf8') as output: print('count,size_1,kko_1,size_2,kko_2,intersect RCs', file=output) for i in combinations(kko_list,2): = i kko_1 = i kko_2 = kko_1.replace('kko.', '') kko_1_frag = getattr(kko, kko_1_frag) kko_1 print(kko_1_frag) = kko_2.replace('kko.', '') kko_2_frag = getattr(kko, kko_2_frag) kko_2 = kko_1.descendants(include_self = False) descent_1 = set(descent_1) descent_1 = len(descent_1) size_1 = kko_2.descendants(include_self = False) descent_2 = set(descent_2) descent_2 = len(descent_2) size_2 = descent_1.intersection(descent_2) intersect = len(intersect) num if num <= count: print(num, size_1, kko_1, size_2, kko_2, intersect, sep=',', file=output) else: print(num, size_1, kko_1, size_2, kko_2, sep=',', file=output) print('KKO typology intersection analysis is done.')
The procedure above takes a few minutes to run. You can inspect what the routine produces at
We can also get summary statistics from the knowledge graph using the
rdflib package. Here is a modification of one of the library’s routine to obtain some VoID statistics:
import collections from rdflib import URIRef, Graph, Literal from rdflib.namespace import VOID, RDF = world.as_rdflib_graph() graph = graph g def generate2VoID(g, dataset=None, res=None, distinctForPartitions=True): """ Returns a VoID description of the passed dataset For more info on Vocabulary of Interlinked Datasets (VoID), see: http://vocab.deri.ie/void This only makes two passes through the triples (once to detect the types of things) The tradeoff is that lots of temporary structures are built up in memory meaning lots of memory may be consumed :) distinctSubjects/objects are tracked for each class/propertyPartition this requires more memory again """ = collections.defaultdict(set) typeMap = collections.defaultdict(set) classes for e, c in g.subject_objects(RDF.type): classes[c].add(e) typeMap[e].add(c) = 0 triples = set() subjects = set() objects = set() properties = collections.defaultdict(int) classCount = collections.defaultdict(int) propCount = collections.defaultdict(set) classProps = collections.defaultdict(set) classObjects = collections.defaultdict(set) propSubjects = collections.defaultdict(set) propObjects = 0 num_classObjects = 0 num_propSubjects = 0 num_propObjects for s, p, o in g: += 1 triples subjects.add(s) properties.add(p) objects.add(o) # class partitions if s in typeMap: for c in typeMap[s]: += 1 classCount[c] if distinctForPartitions: classObjects[c].add(o) classProps[c].add(p) # property partitions += 1 propCount[p] if distinctForPartitions: propObjects[p].add(o) propSubjects[p].add(s) if not dataset: = URIRef('http://kbpedia.org/kko/rc/') dataset if not res: = Graph() res type, VOID.Dataset)) res.add((dataset, RDF. # basic stats res.add((dataset, VOID.triples, Literal(triples)))len(classes)))) res.add((dataset, VOID.classes, Literal( len(objects)))) res.add((dataset, VOID.distinctObjects, Literal(len(subjects)))) res.add((dataset, VOID.distinctSubjects, Literal(len(properties)))) res.add((dataset, VOID.properties, Literal( for i, c in enumerate(classes): = URIRef(dataset + "_class%d" % i) part res.add((dataset, VOID.classPartition, part))type, VOID.Dataset)) res.add((part, RDF. res.add((part, VOID.triples, Literal(classCount[c])))1))) res.add((part, VOID.classes, Literal( "class"], c)) res.add((part, VOID[ len(classes[c])))) res.add((part, VOID.entities, Literal(len(classes[c])))) res.add((part, VOID.distinctSubjects, Literal( if distinctForPartitions: res.add(len(classProps[c])))) (part, VOID.properties, Literal( res.add((part, VOID.distinctObjects,len(classObjects[c])))) Literal(= num_classObjects + len(classObjects[c]) num_classObjects for i, p in enumerate(properties): = URIRef(dataset + "_property%d" % i) part res.add((dataset, VOID.propertyPartition, part))type, VOID.Dataset)) res.add((part, RDF. res.add((part, VOID.triples, Literal(propCount[p])))1))) res.add((part, VOID.properties, Literal( property, p)) res.add((part, VOID. if distinctForPartitions: = 0 entities = set() propClasses for s in propSubjects[p]: if s in typeMap: += 1 entities for c in typeMap[s]: propClasses.add(c) res.add((part, VOID.entities, Literal(entities)))len(propClasses)))) res.add((part, VOID.classes, Literal( res.add((part, VOID.distinctSubjects,len(propSubjects[p])))) Literal( res.add((part, VOID.distinctObjects,len(propObjects[p])))) Literal(= num_propSubjects + len(propSubjects[p]) num_propSubjects = num_propObjects + len(propObjects[p]) num_propObjects print('triples:', triples) print('subjects:', len(subjects)) print('objects:', len(objects)) print('classObjects:', num_classObjects) print('propObjects:', num_propObjects) print('propSubjects:', num_propSubjects) return res, dataset
=None, res=None, distinctForPartitions=True)generate2VoID(g, dataset
(<Graph identifier=Na47c69e2f7b84d9b911c46e2cdf0fe11 (<class 'rdflib.graph.Graph'>)>,
These metrics can go into the pot with the summary statistics we also gain from Protégé. We’ll see some graphic reports on these numbers in the next installment.
I think an honest appraisal may straddle the fence about whether logging makes sense for the cowpoke package. On the one hand, we have begun to assemble a fair degree of code within the package, that perhaps would normally trigger the advisability of logging. On the other hand, we run the various scripts only sporadically, and in pieces when we do. There is not a continuous production function under what we have done, so far.
If we were to introduce this code into a production setting or get multiple developers involved, I would definitely argue for the need for logging. Consider what we have in the current cowpoke code base as the transition condition for looking at this question. However, since logging is good practice, and we are close, let’s go ahead and invoke the capability nonetheless.
One chooses logging over the initial
- The ability to time stamp our logging messages
- The ability to keep our logging messages persistent
- We can generate messages constantly in the background for later inspection, and
- We can better organize our logging messages.
logging module that comes with Python is quite mature and has further advantages:
- We can control the warning level of the messages and what warning levels trigger logging
- We can format the messages as we wish, and
- We can send our messages to file, screen, or socket.
By default, the Python
logging module has five pre-set warning levels:
- debug – detailed information, typically of interest only when diagnosing problems
- info – confirmation that things are working as expected
- warning – an indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected
- error – due to a more serious problem, the software has not been able to perform some function, or
- critical – a serious error, indicating that the program itself may be unable to continue running.
We’ll see in the following steps how we can configure the logger and set it up for working with our existing functions.
Logging is organized as a tree, with the root being the system level. For a single package, it is best to set up a separate main logging branch under the root so that warnings and loggings can be treated consistently throughout the package. This design, for example, allows warning messages and logging levels to be set with a single call across the entire package (sub-branches may have their own conditions). This is what is called adding a ‘custom’ logger to your system.
Configurations may be set in Python code (the method we will use, because it is the simplest) or via a separate
.ini file. Configuration settings include most of the specified items below.
You can set up logging messages to go to console (screen) or file. In our examples below, we will do both.
You can set up how your messages are formatted. We can also format console v file messages differently, as our examples below show.
Whenever we insert a logging message, beside setting severity level, we may also assign a message unique to that part of the code. However, if we choose not to assign a new, specific message, the message invoked will be the default one defined in our configuration.
Since our set up is straightforward, we will put our configuration settings into our existing
config.py file and write our logging messages to the
log subdirectory. Here is how our set up looks (with some in-line commentary):
import logging # Create a custom logger = logging.getLogger(__name__) # Will invoke name of current module logger # Create handlers = 'C:/1-PythonProjects/kbpedia/v300/targets/logs/kbpedia_logging.log' log_file #logging.basicConfig(filename=log_file,level=logging.DEBUG) = logging.StreamHandler() # Separate c_ and f_ handlers c_handler = logging.FileHandler(log_file) f_handler c_handler.setLevel(logging.WARNING) f_handler.setLevel(logging.DEBUG) # Create formatters and add it to handlers # File logs include time stamp, console does not = logging.Formatter('%(name)s - %(levelname)s - %(message)s') c_format = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') f_format c_handler.setFormatter(c_format) f_handler.setFormatter(f_format) # Add handlers to the logger logger.addHandler(c_handler) logger.addHandler(f_handler) 'This is a debug message.') logging.debug('This is an informational message.') logging.info('Warning! Something does not look right.') logging.warning('You have encountered an error.') logging.error('You have experienced a critical problem.')logging.critical(
WARNING:root:Warning! Something does not look right.
ERROR:root:You have encountered an error.
CRITICAL:root:You have experienced a critical problem.
Make sure you have this statement at the top of all of your cowpoke files:
Then, as you write or update your routines, use the
logging.severity() statement where you previously were using
Here is some supporting documentation for today’s installment: