First in an Occasional Series of KBpedia Best Practices
One of my favorite sayings regarding the semantic Web is from James Hendler, now a professor and program director at RPI, but a longstanding contributor to the semantic space, including, among other notable contributions, as a co-author on the seminal paper, “The Semantic Web,” in Scientific American in 2001. His statement was “A little semantics goes a long way,” and I whoeheartedly support that view. I previously gave a shoutout to this saying in my book . In this ‘best practice’ note regarding KBpedia and creating and maintaining knowledge graphs, I want to point out two simple techniques that can immediately benefit your own knowledge representation efforts.
The two items I want to highlight are the use of ‘semsets’ (similar to the synsets used by WordNet) and emphasizing subsumption hierarchies in your knowledge graph design. The actual practice of these items involves, as much as anything, embracing a mindset that is attentive to the twin ideas of semantics and inference.
With this article, I’m also pleased to introduce an occasional series on best practices when creating, applying or maintaining knowledge graphs, using KBpedia as the reference knowledge system. I will be presenting this series throughout 2020 coincident with some exciting expansions and application of the system. These ‘best practice’ articles are not intended to be detailed pieces, my normal practice. Rather, I try to present a brief overview of the item, and then describe the process and benefits of applying it.
The fundamental premise of semantic technologies is “things, not strings.” Labels are only the pointers to a thing, and things may be referred to in many different ways, including, of course, many different languages. Is your ‘happy’ the same as my ‘glad’? Examples abound, as language is an ambiguous affair with meaning often dependent on context.
A single term can refer to different things and a single thing can be (and is!!) referred to by many different labels. The lexical database of WordNet helped attack this problem decades ago, by creating what it called ‘synsets‘ to aggregate the multiple ways (terms) by which a given thing may be referred. The portmanteau of this name comes from the ‘synset’ being an aggregation of synonyms. In keeping with Charles Peirce‘s framing of indexes to a given thing as anything which points to or draws attention to it, we have broadened the idea to include any term or phrase that points to a given thing. This is a broadened semantic sense, so we have given this aggregation of terms the name ‘semset‘, a portmanteau using semantics. Elsewhere , I have very broadly defined a semset as including: synonyms, abbreviations, acronyms, aliases, argot, buzzwords, cognomens, derogatives, diminutives, epithets, hypocorisms, idioms, jargon, lingo, metonyms, misspellings, nicknames, non-standard terms (e.g., Twitter), pejoratives, pen names, pseudonyms, redirects, slang, sobriquets, stage names, or synsets. Note this listing is itself a semset for semset.
So, the best practice is this. Whenever adding a new relation or entity or concept to a knowledge graph, give it as broad of an enumeration of a semset as you can assemble with reasonable effort . Redirects in Wikipedia and
altLabels from Wikidata are two useful starting sources. (You may need to discover other sources for specific domains.) You can see these by the
altLabels within the KBpedia knowledge base; see, as examples, abominable snowman, bird, or cake.
altLabels are part of the many useful constructs in the SKOS (Simple Knowledge Organization System) RDF language, another best practice to apply to your knowledge graphs.
Then, when querying or retrieving data, one can specify standard
prefLabels alone (the single, canonical identifier for the entity) for narrow retrievals, or greatly broaden the query by including the
altLabels. In our own deployments, we also often include a standard text search engine such as Lucene or Elasticsearch for such retrievals, which opens up even more control and flexibility. Semsets are an easily deployed way to bridge your semantics from ‘strings’ to ‘things’.
Subsumption hierarchies simply mean that a parent concept embraces (or ‘subsumes’) child concepts . The subsumption relationship can be one of intensionality, extensionality, inheritance, or mereology. In intensionality, the child has attributes embraced by the parent, such as a bear having hair like other mammals. In extensionality, class members belong to an enumerable group, as in lions and tigers and bears all being mammals. In inheritance, an actual child is subsumed under a parent. In mereology, a composite thing like a car engine has parts such as pistons, rods, or timing device. In the W3C standards of RDF or OWL, what we use in KBpedia to capture our semantic knowledge representations, the ‘class’ construct and its related properties are used to express subsumption hierarchies.
The ‘hierarchy’ idea arises from establishing a tree scaffolding of linked items. In this way, subsets of your knowledge graph resemble taxonomies (or tree-like structures) that proceed from the most general at the top (the ‘root’) to most specific at the bottom (the ‘leaf’). Different types of subsumption relationships are best represented by their own trees. Using such subsumption relations do not preclude other connections or relations in your knowledge graph.
When consistently and logically constructed, a practice that can be learned and can be tested, subsumption hierarchies enable one to infer class memberships. For instance, using the ‘mammal’ example means we can infer a bear is a mammal without so specifying, or, alternatively, we can discover that lions and tigers are also mammals if we know that a bear is a mammal. Subsumption hierarchies are an efficient way to specify group memberships, and a powerful way to overcome imprecise query specifications or to discover implicit relationships.
Using semsets and subsumption hierarchies are easy techniques for incorporating semantics into your knowledge graphs. These two simple techniques (among a few others) readily demonstrate the truth of Hendler’s “a little semantics goes a long way” in improving your knowledge representations.
hiddenLabel), which is a best practice to include, and the W3C standards allow for internationalization of all labels by use of the language tag for labels.