Wikidata editors last week approved adding a new KBpedia ID property (P8408) to their system, and we have therefore followed up by adding nearly 40,000 mappings to the Wikidata knowledge base. We have another 5000 to 6000 mappings still forthcoming that we will be adding in the coming weeks. Thereafter, we will continue to increase the cross-links, as we partially document below.
This milestone is one I have had in my sights for at least the past few years. We want to both: 1) provide a computable overlay to Wikidata; and 2) increase our dependence and use of Wikidata’s unparalleled structured data resources as we move KBpedia forward. Below I give a brief overview of the status of Wikidata, share some high-level views of our experience in putting forward and then mapping a new Wikidata property, and conclude with some thoughts of where we might go next.
The Status of Wikidata
Wikidata is the structured data initiative of the Wikimedia Foundation, the open-source group that oversees Wikipedia and many other notable Web-wide information resources. Since its founding in 2012, Wikidata’s use and prominence have exceeded expectations. Today, Wikidata is a multi-lingual resource with structured data for more than 95 million items, characterized by nearly 10,000 properties. Items are being added to Wikidata at a rate of nearly 5 million per month. A rich ecosystem of human editors and bots patrols the knowledge base and its entries to enforce data quality and consistency. The ecosystem includes tools for bulk loading of data with error checks, search including structured SPARQL queries, and navigation and visualization. Errors and mistakes in the data occur, but the system ensures such problems are removed or corrected as discovered. Thus, as data growth has occurred, so has quality and usefulness improved.
From KBpedia’s standpoint, Wikidata represents the most complete complementary instance data and characterization resource available. As such, it is the driving wheel and stalking horse (to mix eras and technologies) to guide where and how we need to incorporate data and its types. These have been the overall motivators for us to embrace a closer relationship with Wikidata.
As an open governance system, Wikidata has adopted its own data models, policies, and approval and ingest procedures for adopting new data or characterizations (properties). You might find it interesting to review the process and ongoing dialog that accompanied our proposal for a KBpedia ID as a property in Wikidata. As of one week ago, KBpedia ID was assigned Wikidata property P8408. To date, more than 60% of Wikidata properties have been such external identifiers, and IDs are the largest growing category of properties. Since most properties that relate to internal entity characteristics have already been identified and adopted, we anticipate mappings to external systems will continue to be a dominant feature of the growth in Wikidata properties to come.
Our Mapping Experience
There are many individuals that spend considerable time monitoring and overseeing Wikidata. I am not one of them. I had never before proposed a new property to Wikidata, and had only proposed one actual Q item (Q is the standard prefix for an entity or concept in Wikidata) for KBpedia prior to proposing our new property.
Like much else in the Wikimedia ecosystem, there are specific templates put in place for proposing a new Q item or P proposal (see the examples of external identifiers, here). Since there are about 10,000 times more Q items than properties, the path for getting a new property approved is more stringent.
Then, once a new property is granted, there are specific paths like QuickStatements or others that need to be followed in order to submit new data items (Q ids) or characteristics (property by Q ids). I made some newbie mistakes in my first bulk submissions, and fortunately had a knowledgeable administrator (@Epidosis) help guide me through making the fixes. For example, we had to back off about 10,000 updates because I had used the wrong form for referencing a claim. Once reclaimed, we were able to again upload the mappings.
As one might imagine, updates and changes are being submitted by multiple human agents and (some) bots at all times into the system. The facilities like QuickStatements are designed to enable batch uploads, and allow re-submissions due to errors. You might want to see what is currently active on the system by checking out this current status.
With multiple inputs and submitters, it takes time for large claims to be uploaded. In the case our our 40,000 mappings, we also accompanied each of those with a source and update data characterization, leading to a total upload of more than 120,000 claims. We split our submissions over multiple parts or batches, and then re-submitted if initial claims error-ed out (for example, if the base claim had not been fully registered, the next subsidiary claims might error due to lack of the registered subject; upon a second pass, the subject would be there and so no error). We ran our batches at off times for both Europe and North America, but the total time of the runs still took about 12 hours.
Once loaded, the internal quality controls of Wikidata kick in. There are both bots and human editors that monitor concepts, both of which can flag (and revert) the mapping assignments made. After three days of being active on Wikidata, we had a dozen reverts of initial uploaded mappings, representing about 0.03% of our suggested mappings, which is gratifyingly low. Still, we expect to hear of more such errors, and we are committed to fix all identified. But, at this juncture, it appears our initial mappings were of pretty high quality.
We had a rewarding and learning experience in uploading mappings to Wikidata and found much good will and assistance from knowledgeable members. Undoubtedly, everything should be checked in advance to ensure quality assertions when preparing uploads to Wikidata. But, if done, the system and its editors also appear quite capable to identify and enforce quality control and constraints as encountered. Overall, I found the entire data upload process to be impressive and rewarding. I am quite optimistic of this ecosystem continuing to improve moving forward.
The result of our external ID uploads and mappings can be seen in these SPARQL queries regarding the KBpedia ID property on Wikidata:
- A random listing of mappings
- Here is the count of current mappings
- Here are the first 1000 mappings
- Top 100 instance and sub-class mappings and,
- Items with the most other external identifiers (mappings).
As of this writing, the KBpedia ID is now about the 500th most prevalent property on Wikidata.
What is Next?
Wikidata is clearly a dynamic data environment. Not only are new items being added by the millions, but existing items are being better characterized and related to external sources. To deal with the immense scales involved requires automated quality checking bots with human editors committed to the data integrity of their domains and items. To engage in a large-scale mapping such as KBpedia also requires a commitment to the Wikidata ecosystem and model.
Initiatives that appear immediately relevant to what we have put in place relating to Wikidata include to:
- Extend the current direct KBpedia mappings to fix initial mis-assignments and to extend coverage to remaining sections of KBpedia
- Add additional cross-mappings that exist in KBpedia but have not yet been asserted in Wikidata (for example, there are nearly 6,000 such UNSPSC IDs)
- Add equivalent class (P1709) and possible superproperties (P2235) and subproperties (P2236) already defined in KBpedia
- Where useful mappings are desirable, add missing Q items used in KBpedia to Wikidata
- And, most generally, also extend mappings to the 5,000 or so shared properties between Wikidata and KBpedia.
I have been impressed as a user of Wikidata for some years now. This most recent experience also makes me enthused about contributing data and data characterizations directly.
To Learn More
The KBpedia Web site provides a working KBpedia explorer and demo of how the system may be applied to local content for tagging or analysis. KBpedia splits between entities and concepts, on the one hand, and splits in predicates based on attributes, external relations, and pointers or indexes, all informed by Charles Peirce‘s writings related to knowledge representation. KBpedia was first released in October 2016 with some open source aspects, and was made fully open in 2018. KBpedia is partially sponsored by Cognonto Corporation. All resources are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.