We first released KBpedia as open source in October 2018 with version 1.60. We needed to release it then because of the pending release of my new book, A Knowledge Representation Practionary: Guidelines Based on Charles Sanders Peirce (Springer), which has liberal ties to KBpedia. We were pleased with that first open-source release of KBpedia, but did not have time to complete our full list of what we considered to be a proper baseline for the initial release. We have spent the past few months completing that list and are now pleased to announce version 2.00 of KBpedia, what we consider to be the first complete, open-source baseline of this knowledge artifact.
KBpedia is a computable knowledge graph that sits astride Wikipedia and Wikidata and other leading knowledge bases. Its baseline 55,000 reference concepts provide a flexible and expandable means for relating your own data records to a common basis for reasoning and inferring logical relations and for mapping to virtually any external data source or schema. The framework is a clean starting basis for doing knowledge-based artificial intelligence (KBAI) and to train and use virtual agents. KBpedia combines seven major public knowledge bases — Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, OpenCyc, and UMBEL. KBpedia supplements these core KBs with mappings to more than a score of additional leading vocabularies. The entire KBpedia structure is computable, meaning it can be reasoned over and logically sliced-and-diced to produce training sets and reference standards for machine learning and data interoperability. KBpedia provides a coherent overlay for retrieving and organizing Wikipedia or Wikidata content. KBpedia greatly reduces the time and effort traditionally required for knowledge-based artificial intelligence (KBAI) tasks.
KBpedia is also a comprehensive knowledge structure for promoting data interoperability. KBpedia’s upper structure, the KBpedia Knowledge Ontology (KKO), is based on the universal categories and knowledge representation theories of the great 19th century American logician, philosopher, polymath and scientist, Charles Sanders Peirce. This design provides a logical and coherent underpinning to the entire KBpedia structure. The design is also modular and fairly straightforward to adapt to enterprise or domain purposes. KBpedia provides a powerful reference scaffolding for bringing together your own internal data stovepipes into a comprehensive whole. KBpedia, and extensions specific to your own domain needs, can be deployed incrementally, gaining benefits each step of the way, until you have a computable overlay tieing together all of your valuable information assets.
Major Activities to Complete the Baseline
Some areas received major attention and some were largely ignored in completing this open-source baseline of KBpedia. For example, no changes (other than minor cleanup often related to other changes) were made to the property scope of KBpedia or their mappings to Wikidata or schema.org. The typologies were also not adjusted or expanded (except for minor cleanup related to other changes). The general scope of KBpedia remained virtually unchanged. However, a number of areas were targeted for specific attention and improvement. Notably:
- Definitions were completed for 100% of the 55,000 reference concepts. Since the decision to open source KBpedia, the number of concepts with definitions grew by nearly 40%, or new definitions for about 15,000 entries;
- Mappings to instances and classes in Wikidata were greatly expanded. Mappings now exist to 32 million entities in Wikidata, representing over 80% of the useful data in that system . Over 80% of KBpedia’s 55 K reference concepts are also now mapped to specific Wikidata entries;
- Mappings to Wikipedia also grew and kept pace with this Wikidata mapping. Total mappings to Wikipedia only grew 10% because of the larger number of prior mappings. Still, coverage of Wikipedia is also now about 80% based on either mapped RCs or coverage of Wikipedia articles;
- Due to early mapping choices, KBpedia was not consistent in the use of plural v singular terms. We inspected and converted about 4500 plural concepts into singular expressions, consistent with what we see as best naming practices;
- Because of this mixed naming, and some other synonym issues, we had a pool of reference concept (RC) duplicates in the system that totaled nearly 1400 items, which were consolidated and then removed. The overall size of KBpedia, however, did not change much, since all of these inspections also resulted in the addition of about 1200 new concepts, often at intermediate layers that improved the overall graph connectivity; and
- Since the initiation of KBpedia, about 21,000 new concepts have been added over the starting OpenCyc RC structure. Each of these 21,000 RCs was reviewed, with about 5,000 flagged for detailed scrutiny. All of these flagged items were further reviewed, frequently resulting in new definitions, new parental assignments, new altLabels, or the addition of other property relations.
Despite these massive efforts, we are certainly not claiming an error-free structure. Logic and consistency tests are a constant activity and the addition or deletion of new concepts also requires testing and sometimes changes. Nonetheless, we are proud of this version 2.00 and believe KBpedia to be the cleanest it has ever been.
Statistics on the KBpedia v 200 Release
I show in the following table the statistics and changes compared to the first open-source release of KBpedia (v. 160) and the prior and last proprietary release (v. 151). The comparison to v 151 represents the total changes in the move to open source. Please note in the table that we measure coverage as the either the larger of: a) percent of external concepts mapped; or b) percent of KBpedia RCs mapped to the external source (predominantly unique).
|From 1.60||From 1.51|
|Structure||Value||% Change||% Change||Coverage|
|No. of RCs||54,713||-0.3%||2.4%|
|Std RCs w/ definitions||54,537||33.2%||38.4%||100%|
|No. of mapped vocabularies||23||0.0%||-14.8%|
|No. of typologies||68||0.0%||7.9%|
|Core entity types||33||0.0%||0.0%|
|Other core types||5||0.0%||0.0%|
|No. of properties||4,847||0.0%||92.4%|
The table shows the significant improvements made to KBpedia since the decision to release it as open source. The property mappings nearly doubled, now with significant mappings to both Wikidata and schema.org properties. The amount of mappings to Wikidata entities (Q items) increased nearly eight-fold (8 x), with coverage now more than 80 percent to both Wikidata and Wikipedia. The structure is fairly clean and consistent, with all reference concepts now including a definition, and most with a slew of alternative labels to improve matching and retrieval. Through its mapped sources, KBpedia links to more than 30 million entities, most all with data attributes (Wikidata) and complete articles (Wikipedia). The system is inherently designed for expansion into multiple languages.
Moving Beyond the Baseline
Of course, a knowledge artifact like KBpedia can be bounded in many ways. It is somewhat arbitrary what we define as a proper baseline. Our general image was a clean and computable framework adhering to best practices that maps to at least 80% of both Wikipedia and Wikidata. We have accomplished this baseline in the current release.
But our ambitions for KBpedia do not end there. Here are some of the major areas we will be working on for future versions:
- Still better definitions for many concepts, particularly those with short or limited definitions. A few thousand candidates exist for this attention;
- Adding another 1,000 or so new Wikidata Q items will increase instance coverage to more than 97% and raise class coverage to over 80%;
- Complete the products and services mappings to the UNSPSC (United Nations Standard Products and Services Code) classification scheme, plus the likely split of the Products typology into three distinct branches;
- Improved automatic tests for errors and oversights. We will be documenting our mapping experiences, among other topics, in a new ‘In the Trenches’ blog series I will begin early this year;
- Test marginal overlaps between SuperTypes (typologies) for various reference concepts in order to improve assignments and increase disjointedness assertions even further;
- Cross-check existing mappings from external sources to Wikidata against KBpedia assignments (GeoNames features, for example) and reconcile differences;
- Create various vector files for the KBpedia reference nodes using techniques such as explicit semantic analysis (ESA), word2vec, GloVe, and perhaps others; and
- Open source the build code for KBpedia.
Quite a while back I estimated that KBpedia might eventually grow to 85 K reference concepts or so in order to provide an equivalent, complete baseline coverage of topics across human knowledge domains. After this most recent detailed review, I think those prior numbers to be an overestimate. After detailed inspection and comparison with Wikipedia and Wikidata, I now suspect a ‘complete’ structure may require only 60 K to 65 K reference concepts. (Of course, the depth or breadth of KBpedia are virtually expandable to capture any knowledge domain.) This reduced estimate also includes that the present KBpedia has perhaps 1000 – 2000 unduly specific items (lists of individual species, for example) that probably should be culled to bring the overall structure into balance.
In any case, we welcome suggestions for further enhancements or tackling your own improvements. Please let me know what ideas you may have.
To Get the Goodies
The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce’s prescient theories of knowledge representation.
Mappings to all external sources are provided in the linkages to the external resources file in the KBpedia downloads. (A larger inferred version is also available.) The external sources keep their own record files. KBpedia distributions provide the links. However, you can access these entities through the KBpedia explorer on the project’s Web site (see these entity examples for cameras, cakes, and canyons; clicking on any of the individual entity links will bring up the full instance record. Such reachthroughs are straightforward to construct.)
Here are the various KBpedia resources that you may download or use for free with attribution:
- The complete KBpedia v 200 knowledge graph (8.5 MB, zipped). This download is likely your most useful starting point
- KBpedia’s upper ontology, KKO (332 KB), which is easily inspected and navigated in an editor
- The annotated KKO (321 KB). This is NOT an active ontology, but is has the upper concepts annotated to more clearly show the Peircean categories of Firstness (1ns), Secondness (2ns), and Thirdness (3ns)
- The 68 individual KBpedia typologies in N3 format
- The KBpedia mappings to the seven core knowledge bases and the additional extended knowledge bases in N3 format
- A version of the full KBpedia knowledge graph extended with linkages to the external resources (10.5 MB, zipped), and
- A version of the full KBpedia knowledge graph extended with inferences and linkages (14.7 MB, zipped).
The last two resources require time and sufficient memory to load. We invite and welcome contributions or commentary on any of these resources.