Posted:July 28, 2020

A Peek at Forthcoming Topics in the Cooking with Python and KBpedia Series

Throughout this Cooking with Python and KBpedia series I want to cover some major topics about how to use, maintain, build, and extend knowledge graphs, using our KBpedia knowledge system as the model. KBpedia is a good stalking horse for these exercises — really, more like recipes — because it has broad domain coverage and a modular design useful for modification or extension for your own domain purposes. At appropriate times throughout these exercises you may want to fork your own version of KBpedia and the accompanying code to whack at for your own needs.

Today, as I begin my first articles, I am anticipating scores of individual installments in our CWPK series. Some 55 of these installments and the associated code have already been written. Though the later portion of the entire series gets more complicated, I am hoping that the four months or so it will take to publish the anticipated 75 installments at the rate of one per business day will give me sufficient time to complete the series in time. We shall see.

My intent is not to provide a universal guide, since I will be documenting steps using a specific knowledge graph, a specific environment (Windows 10 and some AWS Ubuntu), and a specific domain (knowledge management and representation). As the series name defines, we will only be working with the Python language. We’ve made these choices because of familiarity, and our desire to produce a code base at the series’ conclusion that better enables users to modify and extend KBpedia from scratch. We also are focusing on the exploring newbie more than the operating enterprise. I do not touch on issues of security, code optimization, or scalability. My emphasis is more on simplicity and learning, not performance or efficiency. Those with different interests may want to skip some installments and consult the suggested resources. Still, truly to my knowledge, there is nothing like this series out there: a multiple installment, interactive and didactive environment for learning ‘soup-to-nuts‘ practical things about semantic technologies and knowledge graphs.

The first part of the installments deals with the design intent of the series, architecture, and selection and installation of tools and applications. Once we have the baseline system loaded and working, we explore basic use and maintenance tasks for a knowledge graph. We show how those are done with a common ontology development environment, the wonderful Protégé open-source IDE, as well as programmatically through Python. At various times we will interact with the knowledge base using Python programmatically or via the command line, electronic notebooks, or Web page templates. There are times when Protégé is absolutely the first tool of choice. But to extract the most value from a knowledge graph we also need to drive it programmatically, sometimes analyze it as a graph, do natural language tasks, or use it to stage training sets or corpora for supervised or unsupervised maching learning. Further, to best utilize knowledge graphs, we need to embed them in our daily workflows, which means interacting with them in a distributed, multiple-application way, as opposed to a standalone IDE. This is why we must learn tools that go beyond navigation and inspection. The scripts we will learn from these basic use and maintenance tasks will help us get famiiar with our new Python surroundings and to set a foundation for next topics.

In our next part, we begin working with Python in earnest. Of particular importance is finding a library or set of APIs to work directly with OWL, the language of the KBpedia knowledge graph. There are many modifications and uses of KBpedia that existing Python tools can aid. Having the proper interface or API to talk directly to the object types within the knowledge graph is essential. There are multiple options for how to approach this question, and no single, ‘true’ answer. Once we have selected and installed this library, we then need to sketch out the exact ways we intend to access, use and modify KBpedia. These actions then set our development agenda for finding and scripting Python tools into our workflow.

There are off-the-shelf Python tools for querying the knowledge graph (SPARQL), adding rules to the graph (SWRL), and visualizing the outputs. Because we are also using Python to manipulate KBpedia, we also need to understand how to write outputs to file, necessary as inputs to other third-party tools and advanced applications. Since this is the first concentrated section involved in finding and modifying existing Python code, we’ll also use a couple of installments to assemble and document useful Python coding fragments and tips.

Armed with this background, and having gotten our feet wet a bit with Python, we are now positioned to begin writing our own Python code to achieve our desired functionality. We begin this process by focusing on some standard knowledge graph modification tasks: adding or changing nodes, properties or labels (annotations) within our knowledge graph. Of course, these capabilities are available directly from Protégé. However, we want to develop our own codes for this process in line with how we build and test these knowledge graphs in the first place. These skills set the foundation for how we can filter and make subset selections as training sets for machine learning. One key installment from this part of the CWPK series involves how we set up comma-separated values (CSV) files in UTF-8 format, the standard we have adopted for staging for use by third-party tools and in KBpedia’s build process. We also discuss the disjoint property in OWL and its relation to the modular typology design used by KBpedia. Via our earlier Python-OWL mappings we will see how we can use the Python language itself to manipulate these OWL constructs.

The installments in this initial part of the series are about setting up and learning about our tools and environment so that we can begin doing new, real work against our knowledge graphs. The first real work to which we will apply these tools is the extraction-and-build routines by which we can produce a knowledge graph from scratch. One of the unusual aspects of KBpedia is that the knowledge graph, as large and as comprehensive as it may be, is built entirely from a slate of flat (CSV, in N3 format) input files. KBpedia, in its present version 2.50, has about five standard input files, plus five optional files for various fixes or overrides, and about 30 external mapping files (which vary, obviously, by the number of external sources we integrate). These files can be easily and rapidly edited and treated in bulk, which are then used as the inputs to the build process. The build process also integrates a number of syntax and logical checks to make sure the finally completed knowledge graph is consistent and satisfiable, with standardized syntax. As errors are surfaced, modifications get made, until the build finally passes its logic tests. Multiple build iterations are necessary for any final public release. One of the reasons we wanted a more direct Python approach in this series was to bootstrap the routines and code necessary to enable this semi-automated build process. The build portion, in particular, has been part of KBpedia’s special sauce from the beginning, but is a capability we have not yet documented nor shared with the world. Since its beginning in 2018, each release of KBpedia has been the result of multiple build cycles to produce a tested, vetted knowledge graph.

In exposing these standard methods, we also needed to add to them a complementary set of extraction routines. Getting familiar with extracting resources from a knowledge graph has many practical applications and is an easier way to start learning Python. This substantive portion of CWPK is new, and gives us the ability to break down a knowledge graph into a similar set of simple, constituent input files. We thus end up with a unique roundtripping environment that, while specific to KBpedia as expressed in this series, can be modified though the Python code that accompanies this series for potentially any knowledge graph. It has also resulted in a more generalized approach suitable for other knowledge graphs than our original internal design supported. We now have a re-factored, second-generation set of knowledge graph extract-and-build routines. Some modifications may be needed for other types of starting knowledge graphs, but we hope the steps documented can be readily adapted for those purposes based on how we have described and coded them. Because of these advances, we will witness a new version release of KBpedia along this path.

The last major portion of installments provides some recipes on how to use and leverage a knowledge graph. A portion of those installments involve creating machine learning training sets and corpora. We will tie into some leading Python applications and libraries to conduct a variety of supervised and unsupervised learning tasks including categorization, clustering, fine-grained entity recognition, and sentiment and relation extraction. We will touch upon the leading Python toolsets available, and gain recipes for general ways to work with these systems. We will generate and use graph and word embedding models to also support those tasks, plus new ones like summarization. We will undertake graph analytics and do random walks over the KBpedia graph to probe network concepts like community, influence, and centrality. We will tie into general natural language processing (NLP) toolkits and show the value for these uses that may be extracted from KBpedia. This rather lengthy part of our series includes how to set up an endpoint for our knowledge graph and how to tie into graphing and charting libraries for various visualizations.

Since I am not a professional programmer and certainly am no expert in Python, the codes that are produced and distributed in this series are intended as starting off points. Perhaps others may develop more polished and performant code for these purposes over time. I welcome such input and will do what we can can to bring awareness and distribution mechanisms to any such improvements. But, crude and simple as they may be, all of the Python tools we build during this series, plus the instructions in the why and how to do so as demonstrated through interactive Jupyter Notebooks, can help start you on the path to modify, delete, or extend what exists in KBpedia with your own domain graphs and data.

We devote the concluding installments in our CWPK series to how you may leverage these resources to tackle your own needs and domain. We also try to provide additional resources each step of the way to aid your own learning. In the aggregate, these installments cover an awful lot of ground. However, inch by inch, we will make reaching our end goal a cinch.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Posted by AI3's author, Mike Bergman Posted on July 28, 2020 at 9:39 am in CWPK, KBpedia, Semantic Web Tools | Comments (2)
The URI link reference to this post is: https://www.mkbergman.com/2327/cwpk-2-what-to-expect/
The URI to trackback this post is: https://www.mkbergman.com/2327/cwpk-2-what-to-expect/trackback/
Posted:July 27, 2020

Intro to an Ongoing Series of More than 70 Recipes to Work with KBpedia

We decided to open source the KBpedia knowledge graph in October 2018. KBpedia is a unique knowledge system that intertwines seven ‘core’ public knowledge bases — Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, OpenCyc, and standard UNSPSC products and services. KBpedia’s explicit purpose is to provide a computable scaffolding and design for data interoperability and knowledge-based artificial intelligence (KBAI).

Written primarily in OWL 2, KBpedia includes more than 58,000 reference concepts, mapped linkages to about 40 million entities (most from Wikidata), and 5,000 relations and properties, all organized according to about 70 modular typologies. KBpedia’s upper structure is the KBpedia Knowledge Ontology. KKO weaves the major concepts from these seven core knowledge bases into an integrated whole based on the universal categories and knowledge representation insights of the great 19th century American logician, polymath and scientist, Charles Sanders Peirce.

We have continued to expand KBpedia’s scope and refine its design since first released. Though the entire structure has been available for download through a number of version releases, it is fair to say that only experienced semantic technologists have known how to install and utilize these files to their fullest. Further, we have an innovative ‘build-from-scratch’ design in KBpedia that has not yet been shared. Our objective in this ongoing series is to publish daily ‘recipes’ over a period of about four months on how the general public may learn to use the system and also to build it. With that knowledge, it should be easier to modify KBpedia for other specific purposes.

The mindset we have adopted to undertake this series is that of a focused, needful ‘newbie.’ The individual we have in mind may not know programming and may not know ontologies, but is willing to learn enough about these matters in order to move forward with productive work using KBpedia (or derivatives of it). Perhaps our newbie knows some machine learning, but has not been able to bring multiple approaches and tools together using a consistent text- and structure-rich resource. Perhaps our newbie is a knowledge manager or worker desirous of expanding their professional horizons. This focus leads us to the very beginning of deciding what resources to learn; these early decisions are some of the hardest and most impactful for whether ultimate aims are met. Those with more experience may skip these first installments, but may find some value in a quick scan nonetheless.

The first installments in our series begin with those initial decisions, move on to tools useful throughout the process, and frame how to load and begin understanding the baseline resources of the KBpedia open-source distribution. We then discuss standard knowledge management tasks that may be applied to the resources. One truism about keeping a knowledge system relevant and dynamic is to make sure the effort put into it continues to deliver value. We then begin to conduct work with the system in useful areas that grow in complexity from intelligent retrieval, to entity and relation extractions, natural language understanding, and machine learning. The intermediate part of our series deals with how to build KBpedia from scratch, how to test it logically, and how to modify it for your own purposes. Our estimate going in is that we will offer about 75 installments in this series, to conclude before US Thanksgiving. Aside from a few skipped days on holidays and such, we will post a new installment every business day between now and then.

The ‘P‘ in CWPK comes from using the Python language. In our next installment, we discuss why we chose it for this series over KBpedia’s original Clojure language roots. Because Python is a new language for us, throughout this series we document relevant aspects of learning that language as well. Adding new language aspects to the mix is consistent with our mindset to embrace the newbie. Even if one already knows a programming language, extracting maximum advantage from a knowledge graph well populated with entities and text is likely new. As we go through the process of Python earning its role in the CWPK acronym, we will take some side paths and find interesting applications or fun wrinkles to throw into the mix.

We will also be bringing this series to you via the perspective of our existing systems: Windows 10 on desktops and laptops, and Linux Ubuntu on Amazon Web Services (AWS) cloud servers. These are broadly representative systems. Unfortunately, our guidance and series will have less direct applicability to Apple or other Linux implementations.

Look for the next installment tomorrow. As we put out an installment per business day over the next four months, we’ll learn much together through this process. Please let me know how it is going or what you would like to learn. Let the journey begin . . . .

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Posted by AI3's author, Mike Bergman Posted on July 27, 2020 at 1:13 pm in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2324/cwpk-1-cooking-with-python-and-kbpedia/
The URI to trackback this post is: https://www.mkbergman.com/2324/cwpk-1-cooking-with-python-and-kbpedia/trackback/
Posted:July 13, 2020

KBpediaNew KBpedia ID Property and Mappings Added to Wikidata

Wikidata editors last week approved adding a new KBpedia ID property (P8408) to their system, and we have therefore followed up by adding nearly 40,000 mappings to the Wikidata knowledge base. We have another 5000 to 6000 mappings still forthcoming that we will be adding in the coming weeks. Thereafter, we will continue to increase the cross-links, as we partially document below.

This milestone is one I have had in my sights for at least the past few years. We want to both: 1) provide a computable overlay to Wikidata; and 2) increase our dependence and use of Wikidata’s unparalleled structured data resources as we move KBpedia forward. Below I give a brief overview of the status of Wikidata, share some high-level views of our experience in putting forward and then mapping a new Wikidata property, and conclude with some thoughts of where we might go next.

The Status of Wikidata

Wikidata is the structured data initiative of the Wikimedia Foundation, the open-source group that oversees Wikipedia and many other notable Web-wide information resources. Since its founding in 2012, Wikidata’s use and prominence have exceeded expectations. Today, Wikidata is a multi-lingual resource with structured data for more than 95 million items, characterized by nearly 10,000 properties. Items are being added to Wikidata at a rate of nearly 5 million per month. A rich ecosystem of human editors and bots patrols the knowledge base and its entries to enforce data quality and consistency. The ecosystem includes tools for bulk loading of data with error checks, search including structured SPARQL queries, and navigation and visualization. Errors and mistakes in the data occur, but the system ensures such problems are removed or corrected as discovered. Thus, as data growth has occurred, so has quality and usefulness improved.

From KBpedia’s standpoint, Wikidata represents the most complete complementary instance data and characterization resource available. As such, it is the driving wheel and stalking horse (to mix eras and technologies) to guide where and how we need to incorporate data and its types. These have been the overall motivators for us to embrace a closer relationship with Wikidata.

As an open governance system, Wikidata has adopted its own data models, policies, and approval and ingest procedures for adopting new data or characterizations (properties). You might find it interesting to review the process and ongoing dialog that accompanied our proposal for a KBpedia ID as a property in Wikidata. As of one week ago, KBpedia ID was assigned Wikidata property P8408. To date, more than 60% of Wikidata properties have been such external identifiers, and IDs are the largest growing category of properties. Since most properties that relate to internal entity characteristics have already been identified and adopted, we anticipate mappings to external systems will continue to be a dominant feature of the growth in Wikidata properties to come.

Our Mapping Experience

There are many individuals that spend considerable time monitoring and overseeing Wikidata. I am not one of them. I had never before proposed a new property to Wikidata, and had only proposed one actual Q item (Q is the standard prefix for an entity or concept in Wikidata) for KBpedia prior to proposing our new property.

Like much else in the Wikimedia ecosystem, there are specific templates put in place for proposing a new Q item or P proposal (see the examples of external identifiers, here). Since there are about 10,000 times more Q items than properties, the path for getting a new property approved is more stringent.

Then, once a new property is granted, there are specific paths like QuickStatements or others that need to be followed in order to submit new data items (Q ids) or characteristics (property by Q ids). I made some newbie mistakes in my first bulk submissions, and fortunately had a knowledgeable administrator (@Epidosis) help guide me through making the fixes. For example, we had to back off about 10,000 updates because I had used the wrong form for referencing a claim. Once reclaimed, we were able to again upload the mappings.

As one might imagine, updates and changes are being submitted by multiple human agents and (some) bots at all times into the system. The facilities like QuickStatements are designed to enable batch uploads, and allow re-submissions due to errors. You might want to see what is currently active on the system by checking out this current status.

With multiple inputs and submitters, it takes time for large claims to be uploaded. In the case our our 40,000 mappings, we also accompanied each of those with a source and update data characterization, leading to a total upload of more than 120,000 claims. We split our submissions over multiple parts or batches, and then re-submitted if initial claims error-ed out (for example, if the base claim had not been fully registered, the next subsidiary claims might error due to lack of the registered subject; upon a second pass, the subject would be there and so no error). We ran our batches at off times for both Europe and North America, but the total time of the runs still took about 12 hours. 

Once loaded, the internal quality controls of Wikidata kick in. There are both bots and human editors that monitor concepts, both of which can flag (and revert) the mapping assignments made. After three days of being active on Wikidata, we had a dozen reverts of initial uploaded mappings, representing about 0.03% of our suggested mappings, which is gratifyingly low. Still, we expect to hear of more such errors, and we are committed to fix all identified. But, at this juncture, it appears our initial mappings were of pretty high quality.

We had a rewarding and learning experience in uploading mappings to Wikidata and found much good will and assistance from knowledgeable members. Undoubtedly, everything should be checked in advance to ensure quality assertions when preparing uploads to Wikidata. But, if done, the system and its editors also appear quite capable to identify and enforce quality control and constraints as encountered. Overall, I found the entire data upload process to be impressive and rewarding. I am quite optimistic of this ecosystem continuing to improve moving forward.

The result of our external ID uploads and mappings can be seen in these SPARQL queries regarding the KBpedia ID property on Wikidata:

As of this writing, the KBpedia ID is now about the 500th most prevalent property on Wikidata.

What is Next?

Wikidata is clearly a dynamic data environment. Not only are new items being added by the millions, but existing items are being better characterized and related to external sources. To deal with the immense scales involved requires automated quality checking bots with human editors committed to the data integrity of their domains and items. To engage in a large-scale mapping such as KBpedia also requires a commitment to the Wikidata ecosystem and model.

Initiatives that appear immediately relevant to what we have put in place relating to Wikidata include to:

  • Extend the current direct KBpedia mappings to fix initial mis-assignments and to extend coverage to remaining sections of KBpedia
  • Add additional cross-mappings that exist in KBpedia but have not yet been asserted in Wikidata (for example, there are nearly 6,000 such UNSPSC IDs)
  • Add equivalent class (P1709) and possible superproperties (P2235) and subproperties (P2236) already defined in KBpedia
  • Where useful mappings are desirable, add missing Q items used in KBpedia to Wikidata
  • And, most generally, also extend mappings to the 5,000 or so shared properties between Wikidata and KBpedia.

I have been impressed as a user of Wikidata for some years now. This most recent experience also makes me enthused about contributing data and data characterizations directly.

To Learn More

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce‘s writings related to knowledge representation. KBpedia was first released in October 2016 with some open source aspects, and was made fully open in 2018. KBpedia is partially sponsored by Cognonto Corporation. All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Posted:June 15, 2020

KBpediaNew Version Finally Meets the Hurdle of Initial Vision

I am pleased to announce that we released a powerful new version of KBpedia today with e-commerce and logistics capabilities, as well as significant other refinements. The enhancement comes from adding the United Nations Standard Products and Services Code as KBpedia’s seventh core knowledge base. UNSPSC is a comprehensive and logically organized taxonomy for products and services, organized into four levels, with third-party crosswalks to economic and demographic data sources. It is a leading standard for many industrial, e-commerce, and logistics applications.

This was a heavy lift for us. Given the time and effort involved, Fred Giasson, KBpedia’s co-editor, and I decided to also tackle a host of other refinements we had on our plate. All told, we devoted many thousands of person-hours and more than 200 complete builds from scratch to bring this new version to fruition. Proudly I can say that this version finally meets the starting vision we had when we first began KBpedia’s development. It is a solid baseline to build from for all sorts of applications and to make broad outreach for adoption in 2020. Because of the extent of changes in this new version, we have leapfrogged KBpedia’s version numbering from 2.21 to 2.50.

KBpedia is a knowledge graph that provides a computable overlay for interoperating and conducting machine learning across its constituent public knowledge bases of Wikipedia, Wikidata, GeoNames, DBpedia, schema.org, OpenCyc, and, now, UNSPSC. KBpedia now contains more than 58,000 reference concepts and their mappings to these knowledge bases, structured into a logically consistent knowledge graph that may be reasoned over and manipulated. KBpedia acts as a computable scaffolding over these broad knowledge bases with the twin goals of data interoperability and knowledge-based artificial intelligence (KBAI).

KBpedia is built from a expandable set of simple text ‘triples‘ files, specified as tuples of subject-predicate-object (EAVs to some, such as Kingsley Idehen) that enable the entire knowledge graph to be constructed from scratch. This process enables many syntax and logical tests, especially consistency, coherency, and satisfiability, to be invoked at build time. A build may take from one to a few hours on a commodity workstation, depending on the tests. The build process outputs validated ontology (knowledge graph) files in the standard W3C OWL 2 semantic language and mappings to individual instances in the contributing knowledge bases.

As Fred notes, we continue to streamline and improve our build procedures. Major changes like what we have just gone through, be it adding a main source like UNSPSC or swapping out or adding a new SuperType (or typology), often require multiple build iterations to pass the system’s consistency and satisfiability checks. We need these build processes to be as easy and efficient as possible, which also was a focus of our latest efforts. One of our next major objectives is to release KBpedia’s build and maintenance codes, perhaps including a Python option.

Incorporation of UNSPSC

Though UNSPSC is consistent with KBpedia’s existing three-sector economic model (raw products, manufactured products, services), adding it did require structural changes throughout the system. With more than 150,000 listed products and services in UNSPSC, incorporating it needed to balance with KBpedia’s existing generality and scope. The approach was to include 100% of the top three levels of UNSPSC — segments, families, and classes — plus more common and expected product and service ‘commodities’ in its fourth level. This design maintains balance while providing a framework to tie-in any remaining UNSPSC commodities of interest to specific domains or industries. This approach led to integrating 56 segments, 412 families, 3700+ classes, and 2400+ commodities into KBpedia. Since some 1300 of these additions overlapped with existing KBpedia reference concepts, we checked, consolidated, and reconciled all duplicates.

We fully specified and integrated all added reference concepts (RCs) into the existing KBpedia structure, and then mapped these new RCs to all seven of KBpedia’s core knowledge bases. Through this process, for example, we are able to greatly expand the coverage of UNSPSC items on Wikidata from 1000 or so Q (entity) identifiers to more than 6500. Contributing such mappings back to the community is another effort our KBpedia project will undertake next.

Lastly with respect to UNSPSC, I will be providing a separate article on why we selected it as KBpedia’s products and services template, and how we did the integration and what we found as we did. For now, the quick point is that UNSPSC is well-structured and organized according to the three-sector model of the economy, which matches well with Peirce’s three universal categories underlying our design of KBpedia.

Other Major Refinements

These changes were broad in scope. Effecting them took time and broke open core structures. Opportunities to rebuild the structure in cleaner ways arise when the Tinkertoys get scattered and then re-assembled. Some of the other major refinements the project undertook during the builds necessary to create this version were to:

  • Further analyze and refine the disjointedness between KBpedia’s 70 or so typologies. Disjoint assertions are a key mechanism for sub-set selections, various machine learning tasks, querying, and reasoning
  • Increase the number of disjointedness assertions 62% over the prior version, resulting in better modularity. (However, note the actual RCs affected by these improvements is lower than this percentage since many were already specified in prior disjoint pools)
  • Add 37% more external mappings to the system (DBpedia and UNSPSC, principally)
  • Complete 100% of the definitions for RCs across KBpedia
  • Greatly expand the altLabel entries for thousands of RCs
  • Improve the naming consistency across RC identifiers
  • Further clean the structure to ensure that a given RC is specified only once to its proper parent in an inheritance (subsumption) chain, which removes redundant assertions and improves maintainability, readability, and inference efficiency
  • Expand and update the explanations within the demo of the upper KBpedia Knowledge Ontology (KKO) (see kko-demo.n3). This non-working ontology included in the distro makes it easier to relate the KKO upper structure to the universal categories of Charles Sanders Peirce, which provides the basic organizational framework for KKO and KBpedia, and
  • Integrate the mapping properties for core knowledge bases within KBpedia’s formal ontology (as opposed to only offering as separate mapping files); see kbpedia-reference-concepts-mappings.n3 in the distro.

Current Status of the Knowledge Graph

The result of these structural and scope changes was to add about 6,000 new reference concepts to KBpedia, then remove the duplicates, resulting in a total of more than 58,200 RCs in the system. This has increased KBpedia’s size about 9% over the prior release. KBpedia is now structured into about 73 mostly disjoint typologies under the scaffolding of the KKO upper ontology. KBpedia has fully vetted, unique mappings (nearly all one-to-one) to these key sources:

  • Wikipedia – 53,323 (including some categories)
  • DBpedia – 44,476
  • Wikidata – 43,766
  • OpenCyc – 31,154
  • UNSPSC – 6,553
  • schema.org – 842
  • DBpedia ontology – 764
  • GeoNames – 680
  • Extended vocabularies – 249.

The mappings to Wikidata alone link to more than 40 million unique Q instance identifiers. These mappings may be found in the KBpedia distro. Most of the class mapping are owl:equivalentClass, but a minority may be subClass or superClass or isAbout predicates as well.

KBpedia also includes about 5,000 properties, organized into a multi-level hierarchy of attributes, external relations, and representations, most derived from Wikidata and schema.org. Exploiting these properties and sub-properties is also one of the next priorities for KBpedia.

To Learn More

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce‘s prescient theories of knowledge representation. Mappings to all external sources are provided in the linkages to the external resources file in the KBpedia downloads. (A larger inferred version is also available.) The external sources keep their own record files. KBpedia distributions provide the links. However, you can access these entities through the KBpedia explorer on the project’s Web site (see these entity examples for cameras, cakes, and canyons; clicking on any of the individual entity links will bring up the full instance record. Such reach-throughs are straightforward to construct.) See further the Github site for further downloads.

KBpedia was first released in October 2016 with some open source aspects, and was made fully open in 2018. KBpedia is partially sponsored by Cognonto Corporation. All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Posted:February 5, 2020

KBpedia Best PracticesFirst in an Occasional Series of KBpedia Best Practices

One of my favorite sayings regarding the semantic Web is from James Hendler, now a professor and program director at RPI, but a longstanding contributor to the semantic space, including, among other notable contributions, as a co-author on the seminal paper, “The Semantic Web,” in Scientific American in 2001. His statement was “A little semantics goes a long way,” and I whoeheartedly support that view. I previously gave a shoutout to this saying in my book [1]. In this ‘best practice’ note regarding KBpedia and creating and maintaining knowledge graphs, I want to point out two simple techniques that can immediately benefit your own knowledge representation efforts.

The two items I want to highlight are the use of ‘semsets’ (similar to the synsets used by WordNet) and emphasizing subsumption hierarchies in your knowledge graph design. The actual practice of these items involves, as much as anything, embracing a mindset that is attentive to the twin ideas of semantics and inference.

With this article, I’m also pleased to introduce an occasional series on best practices when creating, applying or maintaining knowledge graphs, using KBpedia as the reference knowledge system. I will be presenting this series throughout 2020 coincident with some exciting expansions and application of the system. These ‘best practice’ articles are not intended to be detailed pieces, my normal practice. Rather, I try to present a brief overview of the item, and then describe the process and benefits of applying it.

Semsets

The fundamental premise of semantic technologies is “things, not strings.” Labels are only the pointers to a thing, and things may be referred to in many different ways, including, of course, many different languages. Is your ‘happy’ the same as my ‘glad’? Examples abound, as language is an ambiguous affair with meaning often dependent on context.

A single term can refer to different things and a single thing can be (and is!!) referred to by many different labels. The lexical database of WordNet helped attack this problem decades ago, by creating what it called ‘synsets‘ to aggregate the multiple ways (terms) by which a given thing may be referred. The portmanteau of this name comes from the ‘synset’ being an aggregation of synonyms. In keeping with Charles Peirce‘s framing of indexes to a given thing as anything which points to or draws attention to it, we have broadened the idea to include any term or phrase that points to a given thing. This is a broadened semantic sense, so we have given this aggregation of terms the name ‘semset‘, a portmanteau using semantics. Elsewhere [2], I have very broadly defined a semset as including: synonyms, abbreviations, acronyms, aliases, argot, buzzwords, cognomens, derogatives, diminutives, epithets, hypocorisms, idioms, jargon, lingo, metonyms, misspellings, nicknames, non-standard terms (e.g., Twitter), pejoratives, pen names, pseudonyms, redirects, slang, sobriquets, stage names, or synsets. Note this listing is itself a semset for semset.

So, the best practice is this. Whenever adding a new relation or entity or concept to a knowledge graph, give it as broad of an enumeration of a semset as you can assemble with reasonable effort [3]. Redirects in Wikipedia and altLabels from Wikidata are two useful starting sources. (You may need to discover other sources for specific domains.) You can see these by the altLabels within the KBpedia knowledge base; see, as examples, abominable snowman, bird, or cake. altLabels are part of the many useful constructs in the SKOS (Simple Knowledge Organization System) RDF language, another best practice to apply to your knowledge graphs.

Then, when querying or retrieving data, one can specify standard prefLabels alone (the single, canonical identifier for the entity) for narrow retrievals, or greatly broaden the query by including the altLabels. In our own deployments, we also often include a standard text search engine such as Lucene or Elasticsearch for such retrievals, which opens up even more control and flexibility. Semsets are an easily deployed way to bridge your semantics from ‘strings’ to ‘things’.

Subsumption Hierarchies

Subsumption hierarchies simply mean that a parent concept embraces (or ‘subsumes’) child concepts [4]. The subsumption relationship can be one of intensionality, extensionality, inheritance, or mereology. In intensionality, the child has attributes embraced by the parent, such as a bear having hair like other mammals. In extensionality, class members belong to an enumerable group, as in lions and tigers and bears all being mammals. In inheritance, an actual child is subsumed under a parent. In mereology, a composite thing like a car engine has parts such as pistons, rods, or timing device. In the W3C standards of RDF or OWL, what we use in KBpedia to capture our semantic knowledge representations, the ‘class’ construct and its related properties are used to express subsumption hierarchies.

The ‘hierarchy’ idea arises from establishing a tree scaffolding of linked items. In this way, subsets of your knowledge graph resemble taxonomies (or tree-like structures) that proceed from the most general at the top (the ‘root’) to most specific at the bottom (the ‘leaf’). Different types of subsumption relationships are best represented by their own trees. Using such subsumption relations do not preclude other connections or relations in your knowledge graph.

When consistently and logically constructed, a practice that can be learned and can be tested, subsumption hierarchies enable one to infer class memberships. For instance, using the ‘mammal’ example means we can infer a bear is a mammal without so specifying, or, alternatively, we can discover that lions and tigers are also mammals if we know that a bear is a mammal. Subsumption hierarchies are an efficient way to specify group memberships, and a powerful way to overcome imprecise query specifications or to discover implicit relationships.

Using semsets and subsumption hierarchies are easy techniques for incorporating semantics into your knowledge graphs. These two simple techniques (among a few others) readily demonstrate the truth of Hendler’s “a little semantics goes a long way” in improving your knowledge representations.

NOTE: This is the first article in an occasional series about KBpedia best practices to coincide with new advances, uses, and applications of KBpedia throughout 2020.


[1] Bergman, M. K. Building Out the System. in A Knowledge Representation Practionary: Guidelines Based on Charles Sanders Peirce (ed. Bergman, M. K.) 273–294 (Springer International Publishing, 2018). doi:10.1007/978-3-319-98092-8_13
[2] See the Glossary in [1].
[3] SKOS also provides a property for capturing misspellings (hiddenLabel), which is a best practice to include, and the W3C standards allow for internationalization of all labels by use of the language tag for labels.
[4] In actual language use, one can say a parent ‘subsumes’ a child. Alternatively, one can say a child ‘is subsumed by’ or ‘is subsumed under’ the parent.

Posted by AI3's author, Mike Bergman Posted on February 5, 2020 at 12:22 pm in KBpedia Best Practices | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2293/a-little-semantics-goes-a-long-way/
The URI to trackback this post is: https://www.mkbergman.com/2293/a-little-semantics-goes-a-long-way/trackback/