Posted:July 13, 2020

KBpediaNew KBpedia ID Property and Mappings Added to Wikidata

Wikidata editors last week approved adding a new KBpedia ID property (P8408) to their system, and we have therefore followed up by adding nearly 40,000 mappings to the Wikidata knowledge base. We have another 5000 to 6000 mappings still forthcoming that we will be adding in the coming weeks. Thereafter, we will continue to increase the cross-links, as we partially document below.

This milestone is one I have had in my sights for at least the past few years. We want to both: 1) provide a computable overlay to Wikidata; and 2) increase our dependence and use of Wikidata’s unparalleled structured data resources as we move KBpedia forward. Below I give a brief overview of the status of Wikidata, share some high-level views of our experience in putting forward and then mapping a new Wikidata property, and conclude with some thoughts of where we might go next.

The Status of Wikidata

Wikidata is the structured data initiative of the Wikimedia Foundation, the open-source group that oversees Wikipedia and many other notable Web-wide information resources. Since its founding in 2012, Wikidata’s use and prominence have exceeded expectations. Today, Wikidata is a multi-lingual resource with structured data for more than 95 million items, characterized by nearly 10,000 properties. Items are being added to Wikidata at a rate of nearly 5 million per month. A rich ecosystem of human editors and bots patrols the knowledge base and its entries to enforce data quality and consistency. The ecosystem includes tools for bulk loading of data with error checks, search including structured SPARQL queries, and navigation and visualization. Errors and mistakes in the data occur, but the system ensures such problems are removed or corrected as discovered. Thus, as data growth has occurred, so has quality and usefulness improved.

From KBpedia’s standpoint, Wikidata represents the most complete complementary instance data and characterization resource available. As such, it is the driving wheel and stalking horse (to mix eras and technologies) to guide where and how we need to incorporate data and its types. These have been the overall motivators for us to embrace a closer relationship with Wikidata.

As an open governance system, Wikidata has adopted its own data models, policies, and approval and ingest procedures for adopting new data or characterizations (properties). You might find it interesting to review the process and ongoing dialog that accompanied our proposal for a KBpedia ID as a property in Wikidata. As of one week ago, KBpedia ID was assigned Wikidata property P8408. To date, more than 60% of Wikidata properties have been such external identifiers, and IDs are the largest growing category of properties. Since most properties that relate to internal entity characteristics have already been identified and adopted, we anticipate mappings to external systems will continue to be a dominant feature of the growth in Wikidata properties to come.

Our Mapping Experience

There are many individuals that spend considerable time monitoring and overseeing Wikidata. I am not one of them. I had never before proposed a new property to Wikidata, and had only proposed one actual Q item (Q is the standard prefix for an entity or concept in Wikidata) for KBpedia prior to proposing our new property.

Like much else in the Wikimedia ecosystem, there are specific templates put in place for proposing a new Q item or P proposal (see the examples of external identifiers, here). Since there are about 10,000 times more Q items than properties, the path for getting a new property approved is more stringent.

Then, once a new property is granted, there are specific paths like QuickStatements or others that need to be followed in order to submit new data items (Q ids) or characteristics (property by Q ids). I made some newbie mistakes in my first bulk submissions, and fortunately had a knowledgeable administrator (@Epidosis) help guide me through making the fixes. For example, we had to back off about 10,000 updates because I had used the wrong form for referencing a claim. Once reclaimed, we were able to again upload the mappings.

As one might imagine, updates and changes are being submitted by multiple human agents and (some) bots at all times into the system. The facilities like QuickStatements are designed to enable batch uploads, and allow re-submissions due to errors. You might want to see what is currently active on the system by checking out this current status.

With multiple inputs and submitters, it takes time for large claims to be uploaded. In the case our our 40,000 mappings, we also accompanied each of those with a source and update data characterization, leading to a total upload of more than 120,000 claims. We split our submissions over multiple parts or batches, and then re-submitted if initial claims error-ed out (for example, if the base claim had not been fully registered, the next subsidiary claims might error due to lack of the registered subject; upon a second pass, the subject would be there and so no error). We ran our batches at off times for both Europe and North America, but the total time of the runs still took about 12 hours. 

Once loaded, the internal quality controls of Wikidata kick in. There are both bots and human editors that monitor concepts, both of which can flag (and revert) the mapping assignments made. After three days of being active on Wikidata, we had a dozen reverts of initial uploaded mappings, representing about 0.03% of our suggested mappings, which is gratifyingly low. Still, we expect to hear of more such errors, and we are committed to fix all identified. But, at this juncture, it appears our initial mappings were of pretty high quality.

We had a rewarding and learning experience in uploading mappings to Wikidata and found much good will and assistance from knowledgeable members. Undoubtedly, everything should be checked in advance to ensure quality assertions when preparing uploads to Wikidata. But, if done, the system and its editors also appear quite capable to identify and enforce quality control and constraints as encountered. Overall, I found the entire data upload process to be impressive and rewarding. I am quite optimistic of this ecosystem continuing to improve moving forward.

The result of our external ID uploads and mappings can be seen in these SPARQL queries regarding the KBpedia ID property on Wikidata:

As of this writing, the KBpedia ID is now about the 500th most prevalent property on Wikidata.

What is Next?

Wikidata is clearly a dynamic data environment. Not only are new items being added by the millions, but existing items are being better characterized and related to external sources. To deal with the immense scales involved requires automated quality checking bots with human editors committed to the data integrity of their domains and items. To engage in a large-scale mapping such as KBpedia also requires a commitment to the Wikidata ecosystem and model.

Initiatives that appear immediately relevant to what we have put in place relating to Wikidata include to:

  • Extend the current direct KBpedia mappings to fix initial mis-assignments and to extend coverage to remaining sections of KBpedia
  • Add additional cross-mappings that exist in KBpedia but have not yet been asserted in Wikidata (for example, there are nearly 6,000 such UNSPSC IDs)
  • Add equivalent class (P1709) and possible superproperties (P2235) and subproperties (P2236) already defined in KBpedia
  • Where useful mappings are desirable, add missing Q items used in KBpedia to Wikidata
  • And, most generally, also extend mappings to the 5,000 or so shared properties between Wikidata and KBpedia.

I have been impressed as a user of Wikidata for some years now. This most recent experience also makes me enthused about contributing data and data characterizations directly.

To Learn More

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce‘s writings related to knowledge representation. KBpedia was first released in October 2016 with some open source aspects, and was made fully open in 2018. KBpedia is partially sponsored by Cognonto Corporation. All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Posted:June 15, 2020

KBpediaNew Version Finally Meets the Hurdle of Initial Vision

I am pleased to announce that we released a powerful new version of KBpedia today with e-commerce and logistics capabilities, as well as significant other refinements. The enhancement comes from adding the United Nations Standard Products and Services Code as KBpedia’s seventh core knowledge base. UNSPSC is a comprehensive and logically organized taxonomy for products and services, organized into four levels, with third-party crosswalks to economic and demographic data sources. It is a leading standard for many industrial, e-commerce, and logistics applications.

This was a heavy lift for us. Given the time and effort involved, Fred Giasson, KBpedia’s co-editor, and I decided to also tackle a host of other refinements we had on our plate. All told, we devoted many thousands of person-hours and more than 200 complete builds from scratch to bring this new version to fruition. Proudly I can say that this version finally meets the starting vision we had when we first began KBpedia’s development. It is a solid baseline to build from for all sorts of applications and to make broad outreach for adoption in 2020. Because of the extent of changes in this new version, we have leapfrogged KBpedia’s version numbering from 2.21 to 2.50.

KBpedia is a knowledge graph that provides a computable overlay for interoperating and conducting machine learning across its constituent public knowledge bases of Wikipedia, Wikidata, GeoNames, DBpedia, schema.org, OpenCyc, and, now, UNSPSC. KBpedia now contains more than 58,000 reference concepts and their mappings to these knowledge bases, structured into a logically consistent knowledge graph that may be reasoned over and manipulated. KBpedia acts as a computable scaffolding over these broad knowledge bases with the twin goals of data interoperability and knowledge-based artificial intelligence (KBAI).

KBpedia is built from a expandable set of simple text ‘triples‘ files, specified as tuples of subject-predicate-object (EAVs to some, such as Kingsley Idehen) that enable the entire knowledge graph to be constructed from scratch. This process enables many syntax and logical tests, especially consistency, coherency, and satisfiability, to be invoked at build time. A build may take from one to a few hours on a commodity workstation, depending on the tests. The build process outputs validated ontology (knowledge graph) files in the standard W3C OWL 2 semantic language and mappings to individual instances in the contributing knowledge bases.

As Fred notes, we continue to streamline and improve our build procedures. Major changes like what we have just gone through, be it adding a main source like UNSPSC or swapping out or adding a new SuperType (or typology), often require multiple build iterations to pass the system’s consistency and satisfiability checks. We need these build processes to be as easy and efficient as possible, which also was a focus of our latest efforts. One of our next major objectives is to release KBpedia’s build and maintenance codes, perhaps including a Python option.

Incorporation of UNSPSC

Though UNSPSC is consistent with KBpedia’s existing three-sector economic model (raw products, manufactured products, services), adding it did require structural changes throughout the system. With more than 150,000 listed products and services in UNSPSC, incorporating it needed to balance with KBpedia’s existing generality and scope. The approach was to include 100% of the top three levels of UNSPSC — segments, families, and classes — plus more common and expected product and service ‘commodities’ in its fourth level. This design maintains balance while providing a framework to tie-in any remaining UNSPSC commodities of interest to specific domains or industries. This approach led to integrating 56 segments, 412 families, 3700+ classes, and 2400+ commodities into KBpedia. Since some 1300 of these additions overlapped with existing KBpedia reference concepts, we checked, consolidated, and reconciled all duplicates.

We fully specified and integrated all added reference concepts (RCs) into the existing KBpedia structure, and then mapped these new RCs to all seven of KBpedia’s core knowledge bases. Through this process, for example, we are able to greatly expand the coverage of UNSPSC items on Wikidata from 1000 or so Q (entity) identifiers to more than 6500. Contributing such mappings back to the community is another effort our KBpedia project will undertake next.

Lastly with respect to UNSPSC, I will be providing a separate article on why we selected it as KBpedia’s products and services template, and how we did the integration and what we found as we did. For now, the quick point is that UNSPSC is well-structured and organized according to the three-sector model of the economy, which matches well with Peirce’s three universal categories underlying our design of KBpedia.

Other Major Refinements

These changes were broad in scope. Effecting them took time and broke open core structures. Opportunities to rebuild the structure in cleaner ways arise when the Tinkertoys get scattered and then re-assembled. Some of the other major refinements the project undertook during the builds necessary to create this version were to:

  • Further analyze and refine the disjointedness between KBpedia’s 70 or so typologies. Disjoint assertions are a key mechanism for sub-set selections, various machine learning tasks, querying, and reasoning
  • Increase the number of disjointedness assertions 62% over the prior version, resulting in better modularity. (However, note the actual RCs affected by these improvements is lower than this percentage since many were already specified in prior disjoint pools)
  • Add 37% more external mappings to the system (DBpedia and UNSPSC, principally)
  • Complete 100% of the definitions for RCs across KBpedia
  • Greatly expand the altLabel entries for thousands of RCs
  • Improve the naming consistency across RC identifiers
  • Further clean the structure to ensure that a given RC is specified only once to its proper parent in an inheritance (subsumption) chain, which removes redundant assertions and improves maintainability, readability, and inference efficiency
  • Expand and update the explanations within the demo of the upper KBpedia Knowledge Ontology (KKO) (see kko-demo.n3). This non-working ontology included in the distro makes it easier to relate the KKO upper structure to the universal categories of Charles Sanders Peirce, which provides the basic organizational framework for KKO and KBpedia, and
  • Integrate the mapping properties for core knowledge bases within KBpedia’s formal ontology (as opposed to only offering as separate mapping files); see kbpedia-reference-concepts-mappings.n3 in the distro.

Current Status of the Knowledge Graph

The result of these structural and scope changes was to add about 6,000 new reference concepts to KBpedia, then remove the duplicates, resulting in a total of more than 58,200 RCs in the system. This has increased KBpedia’s size about 9% over the prior release. KBpedia is now structured into about 73 mostly disjoint typologies under the scaffolding of the KKO upper ontology. KBpedia has fully vetted, unique mappings (nearly all one-to-one) to these key sources:

  • Wikipedia – 53,323 (including some categories)
  • DBpedia – 44,476
  • Wikidata – 43,766
  • OpenCyc – 31,154
  • UNSPSC – 6,553
  • schema.org – 842
  • DBpedia ontology – 764
  • GeoNames – 680
  • Extended vocabularies – 249.

The mappings to Wikidata alone link to more than 40 million unique Q instance identifiers. These mappings may be found in the KBpedia distro. Most of the class mapping are owl:equivalentClass, but a minority may be subClass or superClass or isAbout predicates as well.

KBpedia also includes about 5,000 properties, organized into a multi-level hierarchy of attributes, external relations, and representations, most derived from Wikidata and schema.org. Exploiting these properties and sub-properties is also one of the next priorities for KBpedia.

To Learn More

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce‘s prescient theories of knowledge representation. Mappings to all external sources are provided in the linkages to the external resources file in the KBpedia downloads. (A larger inferred version is also available.) The external sources keep their own record files. KBpedia distributions provide the links. However, you can access these entities through the KBpedia explorer on the project’s Web site (see these entity examples for cameras, cakes, and canyons; clicking on any of the individual entity links will bring up the full instance record. Such reach-throughs are straightforward to construct.) See further the Github site for further downloads.

KBpedia was first released in October 2016 with some open source aspects, and was made fully open in 2018. KBpedia is partially sponsored by Cognonto Corporation. All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Posted:December 4, 2019

KBpediaVersion 2.20 of the Knowledge Graph Now Prepped for Release on Public Repositories

Fred Giasson and I, as co-editors, are pleased to announce today the release of version 2.20 of the open-source KBpedia system. KBpedia is a knowledge graph that provides an overlay for interoperating and conducting machine learning across its constituent public knowledge bases of Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, and OpenCyc. KBpedia contains more than 53,000 reference concepts and their mappings to these knowledge bases, structured into a logically consistent knowledge graph that may be reasoned over and manipulated. KBpedia acts as a computable scaffolding over these broad knowledge bases.

We are preparing to register KBpedia on many public repository sites, and we wanted to make sure quality was a high as possible as we begin this process. Since KBpedia is a system built from many constituent knowledge bases, duplicates and inconsistencies can arise when combining them. The rationale for this release was to conduct a comprehensive manual review to identify and remove most of these issues.

We made about 10,000 changes in this newest release. The major changes we made to KBpedia resulting from this inspection include:

  • Removal of about 2,000 reference concepts (RCs) and their mappings and definitions pertaining to individual plant and animal species, which was an imbalance in relation to the other generic RCs in the system;
  • Manual inspection and fixes to the 70 or so typologies (for instance, Animals or Facilities) that are used to cluster the RCs into logical groupings;
  • Removal of references to UMBEL, one of KBpedia’s earlier constituent knowledge bases, due to retirement of the UMBEL system;
  • Fixes due to user comments and suggestions since the prior release of version 2.10 in April 2019; and
  • Adding some select new RCs in order to improve the connectivity and fill gaps with the earlier version.

Without a doubt this is now the cleanest and highest quality release for the knowledge graph. We are now in position to extend the system to new mappings, which will be the focus of future releases. (Expect the next after the first of the year.) The number and structure of KBpedia’s typologies remain unchanged from prior versions. The number of RCs now stands at 53,465, smaller than the 55,301 reference concepts in the prior version.

Besides combining the six major public knowledge bases of Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, and OpenCyc, KBpedia includes mappings to more than a score of additional leading vocabularies. The entire KBpedia structure is computable, meaning it can be reasoned over and logically sliced-and-diced to produce training sets and reference standards for machine learning and data interoperability. KBpedia provides a coherent overlay for retrieving and organizing Wikipedia or Wikidata content. KBpedia greatly reduces the time and effort traditionally required for knowledge-based artificial intelligence (KBAI) tasks. KBpedia was first released in October 2016 with some open source aspects, and was made fully open in 2018. KBpedia is sponsored by Cognonto Corporation.

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates (or relations) based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce‘s prescient theories of knowledge representation. Mappings to all external sources are provided in the linkages to the external resources file in the KBpedia downloads. (A larger inferred version is also available.) The external sources keep their own record files. KBpedia distributions provide the links. However, you can access these entities through the KBpedia explorer on the project’s Web site (see these entity examples for cameras, cakes, and canyons; clicking on any of the individual entity links will bring up the full instance record. Such reachthroughs are straightforward to construct.) See further the Github site for further downloads. All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Posted:April 15, 2019

KBpediaNew Release Includes Manually Vetted Wikidata Mapping

One of the reasons for releasing KBpedia as open source last October was the emerging usefulness of one its main constituent knowledge bases, Wikidata. Wikidata now contains about 45 million useful entities and concepts (so-called Q identifers) and more than a quarter billion data assertions across scores of languages [1]. Many of the efforts undertaken for KBpedia’s open-source release and others since then have been to increase coverage of Wikidata in KBpedia [2]. With the release of KBpedia v 2.10, we have extended the mappings to Wikidata instances to more than 98%. We also have increased coverage of other aspects of structure and properties within Wikidata to very high percentages. In this version 2.10 release we also manually inspected all 45,000 mappings of KBpedia reference concepts to Wikidata instances, resulting in many changes and improvements. The quality of mappings in KBpedia has never been higher.

KBpedia, as you recall, is a computable knowledge graph that sits astride Wikipedia and Wikidata and other leading knowledge bases. Its baseline 55,000 reference concepts provide a flexible and expandable means for relating your own data records to a common basis for reasoning and inferring logical relations and for mapping to virtually any external data source or schema. The framework is a clean starting basis for doing knowledge-based artificial intelligence (KBAI) and to train and use virtual agents. KBpedia combines seven major public knowledge bases — Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, OpenCyc, and UMBEL. KBpedia supplements these core KBs with mappings to more than a score of additional leading vocabularies. The entire KBpedia structure is computable, meaning it can be reasoned over and logically sliced-and-diced to produce training sets and reference standards for machine learning and data interoperability. KBpedia provides a coherent overlay for retrieving and organizing Wikipedia or Wikidata content. KBpedia greatly reduces the time and effort traditionally required for KBAI tasks.

KBpedia is also a comprehensive knowledge structure for promoting data interoperability. KBpedia’s upper structure, the KBpedia Knowledge Ontology (KKO), is based on the universal categories and knowledge representation theories of the great 19th century American logician, philosopher, polymath and scientist, Charles Sanders Peirce. This design provides a logical and coherent underpinning to the entire KBpedia structure. The design is also modular and fairly straightforward to adapt to enterprise or domain purposes. KBpedia provides a powerful reference scaffolding for bringing together your own internal data stovepipes into a comprehensive whole. KBpedia, and extensions specific to your own domain needs, can be deployed incrementally, gaining benefits each step of the way, until you have a computable overlay tieing together all of your valuable information assets.

Major Activities for Version 2.10

Almost all efforts related to KBpedia v 2.10 were focused on Wikidata, though, with their close alliance, many changes also were reflected to the Wikipedia mappings. As noted with the v 2.00 release, the first effort we had was to map Q items (IDs) that have much instance coverage, but were lacking in prior mappings. This attention resulted in adding a net 973 Q IDs to KBpedia. This number is a bit misleading, however, since in the manual inspection phases many duplicates were removed from the system (approx. 2100) and earlier mappings to category Q IDs (approx. 2700) were upgraded to their more specific Q ID instance. Thus, nearly 6,000 Q IDs are now different in this version compared to the prior version 2.00. Since many of the Q IDs also have a direct mapping to a Wikipedia counterpart, these mappings were updated as well. Besides incidental improvements to definitions, linkages and labels that arise when doing such inspections, which were also attended to whenever encountered, no further major changes were made to this newest release.

We are now in very good shape with respect to our mapping and coverage of Wikidata (with a similar profile for Wikipedia). Across a breadth of measures, here is now where we stand with respect to Wikidata coverage [3], with implementation notes provided in the endnotes section:

Wikidata Item No. Items No. Mapped Items Coverage [3]
Q IDs 45,306,576 45,882 00.1% [4]
Q instances 45,306,576 44,458,015 98.1% [4]
Q classes 2,493,795 2,312,116 92.7% [5]
Properties 5,910 3,970 67.2% [6]
P Statements 256,298,963 246,055,199 96.0% [7]
P Qualifiers 38,866,255 31,756,937 81.7% [7,8]
P References 24,582,259 20,121,794 81.9% [7,9]

One of the first observations that jumps out of the table is how relatively few mappings (~ 45 K, or 0.1%) are sufficient to capture nearly all (98%) of the instances contained in Wikidata. This is because a Q ID may be an individual instance or a parent to multiple instances. The KBpedia mappings focus on the parents, through which the individual instances may be obtained. By virtue of the additions and Q mapping improvements in this version, KBpedia has expanded its instance reach from about 30 million entities to now 45 million entities.

Another observation is that we are also capturing a significant portion of the structure of Wikidata (93%) as provided by the mappings to Q IDs with significant subClassOf connections (P279), which is where the taxonomy of the knowledge base is defined. A third summary observation is that we have similarly high levels of coverage to Wikidata properties. However, at present, this is the least developed area of KBpedia with respect to use cases or cross-knowledge base mappings.

A minor change, but useful to the KBpedia Web site, has been our downgrading of the OpenCyc and UMBEL mapped items. They are still mapped in the knowledge structure, but the Web site removes their links in order to highlight the most popular knowledge bases.

Despite these upgrades and enhancements, the coverage of KBpedia in my new book, A Knowledge Representation Practionary: Guidelines Based on Charles Sanders Peirce (Springer), remains current. The book emphasizes theory, architecture and design, which remains unchanged in this current new release of KBpedia. Also note that future areas of improvement were listed in the KBpedia v 2.00 release notice.

Getting the System

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce’s prescient theories of knowledge representation.

Mappings to all external sources are provided in the linkages to the external resources file in the KBpedia downloads. (A larger inferred version is also available.) The external sources keep their own record files. KBpedia distributions provide the links. However, you can access these entities through the KBpedia explorer on the project’s Web site (see these entity examples for cameras, cakes, and canyons; clicking on any of the individual entity links will bring up the full instance record. Such reach-throughs are straightforward to construct.)

Here are the various KBpedia resources that you may download or use for free with attribution:

  • The complete KBpedia v 210 knowledge graph (8.5 MB, zipped). This download is likely your most useful starting point
  • KBpedia’s upper ontology, KKO (332 KB), which is easily inspected and navigated in an editor
  • The annotated KKO (321 KB). This is NOT an active ontology, but is has the upper concepts annotated to more clearly show the Peircean categories of Firstness (1ns), Secondness (2ns), and Thirdness (3ns)
  • The 68 individual KBpedia typologies in N3 format
  • The KBpedia mappings to the seven core knowledge bases and the additional extended knowledge bases in N3 format
  • A version of the full KBpedia knowledge graph extended with linkages to the external resources (10.5 MB, zipped), and
  • A version of the full KBpedia knowledge graph extended with inferences and linkages (14.7 MB, zipped).

The last two resources require time and sufficient memory to load. We invite and welcome contributions or commentary on any of these resources.

All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. KBpedia’s development to date has been sponsored by Cognonto Corporation. We welcome suggestions for further enhancements or tackling your own improvements. Please let me know what ideas you may have.

Notes

[1] Useful mappings exclude mappings to internal Wikimedia sources (such as templates, categories, or infoboxes on Wikipedia and Wikidata) and scholarly articles (linked in other manners). There are about 45 million ‘useful’ records in the current Wikipedia based on these filters.
[2] ‘Coverage’ is understood to be the percentage of useful instances in a source knowledge base to KBpedia that are actually mapped to a specific KBpedia reference concept or property. These source instances are not included in the KBpedia distribution. They are accessed from the source knowledge base directly. Manipulation of the KBpedia knowledge graph results in the identification of this external source data.
[3] The number of items shown for Wikidata does not reflect the total items on the service, but only those that are useful and relevant after administrative categories and such are removed.
[4] See the text where we describe how choosing to map appropriate structural nodes in Wikidata, which themselves have many child instances, leads to large percentage coverage of all available instances. Instance relationships are obtained from the P31 Wikidata property. The Q IDs were obtained from a Feb 19, 2019 Wikidata retrieval.
[5] Like the instance (P31) retrievals, the subClassOf (P279) data was obtained by a SPARQL query to the Wikidata query endpoint. Try it!
[6] The properties data was obtained from the SQID Wikidata service on April 4, 2019. Note, if you try this link, be patient for all of the data to load.
[7] A Wikidata statement pairs a property with a value for a given entity. It is equivalent to an assertion. It is the most basic factual statement in Wikidata.
[8] Wikidata qualifiers allow statements to be expanded on, annotated, or contextualized beyond what can be expressed in just a simple property-value pair.
[9] Wikidata references are used to point to specific sources that back up the data provided in a statement.
Posted:February 6, 2019

KBpediaRelease Constitutes What We Consider As First, Complete Open-Source Baseline

We first released KBpedia as open source in October 2018 with version 1.60. We needed to release it then because of the pending release of my new book, A Knowledge Representation Practionary: Guidelines Based on Charles Sanders Peirce (Springer), which has liberal ties to KBpedia. We were pleased with that first open-source release of KBpedia, but did not have time to complete our full list of what we considered to be a proper baseline for the initial release. We have spent the past few months completing that list and are now pleased to announce version 2.00 of KBpedia, what we consider to be the first complete, open-source baseline of this knowledge artifact.

KBpedia is a computable knowledge graph that sits astride Wikipedia and Wikidata and other leading knowledge bases. Its baseline 55,000 reference concepts provide a flexible and expandable means for relating your own data records to a common basis for reasoning and inferring logical relations and for mapping to virtually any external data source or schema. The framework is a clean starting basis for doing knowledge-based artificial intelligence (KBAI) and to train and use virtual agents. KBpedia combines seven major public knowledge bases — Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, OpenCyc, and UMBEL. KBpedia supplements these core KBs with mappings to more than a score of additional leading vocabularies. The entire KBpedia structure is computable, meaning it can be reasoned over and logically sliced-and-diced to produce training sets and reference standards for machine learning and data interoperability. KBpedia provides a coherent overlay for retrieving and organizing Wikipedia or Wikidata content. KBpedia greatly reduces the time and effort traditionally required for knowledge-based artificial intelligence (KBAI) tasks.

KBpedia is also a comprehensive knowledge structure for promoting data interoperability. KBpedia’s upper structure, the KBpedia Knowledge Ontology (KKO), is based on the universal categories and knowledge representation theories of the great 19th century American logician, philosopher, polymath and scientist, Charles Sanders Peirce. This design provides a logical and coherent underpinning to the entire KBpedia structure. The design is also modular and fairly straightforward to adapt to enterprise or domain purposes. KBpedia provides a powerful reference scaffolding for bringing together your own internal data stovepipes into a comprehensive whole. KBpedia, and extensions specific to your own domain needs, can be deployed incrementally, gaining benefits each step of the way, until you have a computable overlay tieing together all of your valuable information assets.

Major Activities to Complete the Baseline

Some areas received major attention and some were largely ignored in completing this open-source baseline of KBpedia. For example, no changes (other than minor cleanup often related to other changes) were made to the property scope of KBpedia or their mappings to Wikidata or schema.org. The typologies were also not adjusted or expanded (except for minor cleanup related to other changes). The general scope of KBpedia remained virtually unchanged. However, a number of areas were targeted for specific attention and improvement. Notably:

  • Definitions were completed for 100% of the 55,000 reference concepts. Since the decision to open source KBpedia, the number of concepts with definitions grew by nearly 40%, or new definitions for about 15,000 entries;
  • Mappings to instances and classes in Wikidata were greatly expanded. Mappings now exist to 32 million entities in Wikidata, representing over 80% of the useful data in that system [1].  Over 80% of KBpedia’s 55 K reference concepts are also now mapped to specific Wikidata entries;
  • Mappings to Wikipedia also grew and kept pace with this Wikidata mapping. Total mappings to Wikipedia only grew 10% because of the larger number of prior mappings. Still, coverage of Wikipedia is also now about 80% based on either mapped RCs or coverage of Wikipedia articles;
  • Due to early mapping choices, KBpedia was not consistent in the use of plural v singular terms. We inspected and converted about 4500 plural concepts into singular expressions, consistent with what we see as best naming practices;
  • Because of this mixed naming, and some other synonym issues, we had a pool of reference concept (RC) duplicates in the system that totaled nearly 1400 items, which were consolidated and then removed. The overall size of KBpedia, however, did not change much, since all of these inspections also resulted in the addition of about 1200 new concepts, often at intermediate layers that improved the overall graph connectivity; and
  • Since the initiation of KBpedia, about 21,000 new concepts have been added over the starting OpenCyc RC structure. Each of these 21,000 RCs was reviewed, with about 5,000 flagged for detailed scrutiny. All of these flagged items were further reviewed, frequently resulting in new definitions, new parental assignments, new altLabels, or the addition of other property relations.

Despite these massive efforts, we are certainly not claiming an error-free structure. Logic and consistency tests are a constant activity and the addition or deletion of new concepts also requires testing and sometimes changes. Nonetheless, we are proud of this version 2.00 and believe KBpedia to be the cleanest it has ever been.

Statistics on the KBpedia v 200 Release

I show in the following table the statistics and changes compared to the first open-source release of KBpedia (v. 160) and the prior and last proprietary release (v. 151). The comparison to v 151 represents the total changes in the move to open source. Please note in the table that we measure coverage as the either the larger of: a) percent of external concepts mapped; or b) percent of KBpedia RCs mapped to the external source (predominantly unique).

From 1.60 From 1.51
Structure Value % Change % Change Coverage
No. of RCs 54,713 -0.3% 2.4%
KKO 173 0.0% -0.6%
Standard RCs 54,540 -0.3% 2.4%
Std RCs w/ definitions 54,537 33.2% 38.4% 100%
No. of mapped vocabularies 23 0.0% -14.8%
Core KBs 7 0.0% 16.7%
Extended vocabs 16 0.0% -23.8%
No. of typologies 68 0.0% 7.9%
Core entity types 33 0.0% 0.0%
Other core types 5 0.0% 0.0%
Extended types 30 0.0% 20.0%
No. of properties 4,847 0.0% 92.4%
RC Mappings 158,789 14.0% 38.0%
Wikipedia 44,342 5.3% 9.9% 81%
Wikidata 44,909 63.8% 794.4% 82%
schema.org 845 0.0% 15.1% 99%
DBpedia ontology 764 0.0% 0.0% 99%
GeoNames 918 0.0% 0.0% 99%
OpenCyc 33,372 -0.5% -0.5% 61%
UMBEL 33,390 -0.3% -0.3% 61%
Extended vocabs 249 0.0% -4.2%
Property Mappings 4,847 0.0% 92.4%
Wikidata 3,970 0.0% 57.6% 86%
schema.org 877 0.0% N/A 92%

The table shows the significant improvements made to KBpedia since the decision to release it as open source. The property mappings nearly doubled, now with significant mappings to both Wikidata and schema.org properties. The amount of mappings to Wikidata entities (Q items) increased nearly eight-fold (8 x), with coverage now more than 80 percent to both Wikidata and Wikipedia. The structure is fairly clean and consistent, with all reference concepts now including a definition, and most with a slew of alternative labels to improve matching and retrieval. Through its mapped sources, KBpedia links to more than 30 million entities, most all with data attributes (Wikidata) and complete articles (Wikipedia). The system is inherently designed for expansion into multiple languages.

Moving Beyond the Baseline

Of course, a knowledge artifact like KBpedia can be bounded in many ways. It is somewhat arbitrary what we define as a proper baseline. Our general image was a clean and computable framework adhering to best practices that maps to at least 80% of both Wikipedia and Wikidata. We have accomplished this baseline in the current release.

But our ambitions for KBpedia do not end there. Here are some of the major areas we will be working on for future versions:

  • Still better definitions for many concepts, particularly those with short or limited definitions. A few thousand candidates exist for this attention;
  • Adding another 1,000 or so new Wikidata Q items will increase instance coverage to more than 97% and raise class coverage to over 80%;
  • Complete the products and services mappings to the UNSPSC (United Nations Standard Products and Services Code) classification scheme, plus the likely split of the Products typology into three distinct branches;
  • Improved automatic tests for errors and oversights. We will be documenting our mapping experiences, among other topics, in a new ‘In the Trenches’ blog series I will begin early this year;
  • Test marginal overlaps between SuperTypes (typologies) for various reference concepts in order to improve assignments and increase disjointedness assertions even further;
  • Cross-check existing mappings from external sources to Wikidata against KBpedia assignments (GeoNames features, for example) and reconcile differences;
  • Create various vector files for the KBpedia reference nodes using techniques such as explicit semantic analysis (ESA), word2vec, GloVe, and perhaps others; and
  • Open source the build code for KBpedia.

Quite a while back I estimated that KBpedia might eventually grow to 85 K reference concepts or so in order to provide an equivalent, complete baseline coverage of topics across human knowledge domains. After this most recent detailed review, I think those prior numbers to be an overestimate. After detailed inspection and comparison with Wikipedia and Wikidata, I now suspect a ‘complete’ structure may require only 60 K to 65 K reference concepts. (Of course, the depth or breadth of KBpedia are virtually expandable to capture any knowledge domain.) This reduced estimate also includes that the present KBpedia has perhaps 1000 – 2000 unduly specific items (lists of individual species, for example) that probably should be culled to bring the overall structure into balance.

In any case, we welcome suggestions for further enhancements or tackling your own improvements. Please let me know what ideas you may have.

To Get the Goodies

The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce’s prescient theories of knowledge representation.

Mappings to all external sources are provided in the linkages to the external resources file in the KBpedia downloads. (A larger inferred version is also available.) The external sources keep their own record files. KBpedia distributions provide the links. However, you can access these entities through the KBpedia explorer on the project’s Web site (see these entity examples for cameras, cakes, and canyons; clicking on any of the individual entity links will bring up the full instance record. Such reachthroughs are straightforward to construct.)

Here are the various KBpedia resources that you may download or use for free with attribution:

  • The complete KBpedia v 200 knowledge graph (8.5 MB, zipped). This download is likely your most useful starting point
  • KBpedia’s upper ontology, KKO (332 KB), which is easily inspected and navigated in an editor
  • The annotated KKO (321 KB). This is NOT an active ontology, but is has the upper concepts annotated to more clearly show the Peircean categories of Firstness (1ns), Secondness (2ns), and Thirdness (3ns)
  • The 68 individual KBpedia typologies in N3 format
  • The KBpedia mappings to the seven core knowledge bases and the additional extended knowledge bases in N3 format
  • A version of the full KBpedia knowledge graph extended with linkages to the external resources (10.5 MB, zipped), and
  • A version of the full KBpedia knowledge graph extended with inferences and linkages (14.7 MB, zipped).

The last two resources require time and sufficient memory to load. We invite and welcome contributions or commentary on any of these resources.

All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. KBpedia’s development to date has been sponsored by Cognonto Corporation.

Notes

[1] Useful mappings exclude mappings to internal Wikimedia sources (such as templates, categories, or infoboxes on Wikipedia and Wikidata) and scholarly articles (linked in other manners). There are about 44 million ‘useful’ records in the current Wikipedia based on these filters.