Cognonto, our recently announced venture in knowledge-based artificial intelligence (KBAI), has just published three use cases. Two of these use cases are based on extending KBpedia with enterprise or domain data. KBpedia is the KBAI knowledge structure at the heart of Cognonto.
The Cognonto Web site contains longer descriptions of these use cases, with statistics and results where appropriate. We intend to continue to publish more use cases. Notable ones will be broadly announced.
word2vec is an artificial intelligence ‘word embedding’ model that can establish similarities between terms. These similarities can be used to cluster or classify documents by topic, or to characterize them by sentiment, or for recommendations. The rich structure and entity types within Cognonto’s KBpedia knowledge structure can be used, with one or two simple queries, to create relevant domain “slices” of tens of thousands of documents and entities upon which to train word2vec models. This approach eliminates the majority of effort normally associated with word2vec for domain purposes, enabling available effort to be spent on refining the parameters of the model for superior results.
Some key findings are:
- Domain-specific training corpuses work better with less ambiguity than general corpuses for these problems
- Cognonto (through KBpedia) speeds and eases the creation of domain-specific training corpuses for word2vec (and other corpus-based models)
- Other public and private text sources may be readily added to the KBpedia baseline in order to obtain still further domain-relevant models
- Such domain-specific training corpuses can be used to establish similarity between local text documents or HTML web pages
- This method can also be combined with Cognonto’s topics analyzer to first tag text documents using KBpedia reference concepts, and then inform or augment these domain-specific training corpuses, and
- These capabilities enable rapid testing and refinement of different combinations of “seed” concepts to obtain better desired results.
KBpedia provides a rich set of 20 million entities in its standard configuration. However, by including relevant entity lists, which may already be in the possession of the enterprise or from specialty domain datasets, significant improvements can be achieved across all of the standard metrics used for entity recognition and tagging. Here is an example of the standard metrics applied by Cognonto in its efforts:
Cognonto’s standard methodology also includes the creation of reference, or “gold standards”, for measuring the benefits of adding more data or performing other tweaks on the entity extraction algorithms.
Some key findings from this use case in adding private data to KBpedia include:
- In the example used, adding private enterprise data results in more than a doubling of accuracy (108%) over the standard, baseline KBpedia for identifying the publishing organization of a Web page
- Some datasets may have a more significant impact than others, but overall, each dataset contributes to the overall improvements of the predictions. Generally adding more data improves results across all measured metrics
- “Gold standards” are an essential component for testing the value of adding specific datasets or refining machine learning parameters
- Approx. 500 training instances are sufficient to build a useful “gold standard” for entity tagging; negative training examples are also advisable
- Even if all specific entities are not identified, flagging a potential “unknown” entity is an important means for targeted next efforts of adding to the current knowledge base
- KBpedia is a very useful structure and starting point for an entity tagging effort, but that adding domain data is probably essential to gain the overall accuracy desired for enterprise requirements, and
- This use case is broadly applicable to any entity recognition and tagging initiative.
The Cognonto Mapper includes standard baseline capabilities found in other mappers such as string and label comparisons, attribute comparisons, and the like. But, unlike conventional mappers, the Cognonto Mapper is able to leverage both the internal knowledge graph structure and its use of typologies (most of which do not overlap with one another) to add structural comparators as well. These capabilities lead to more automation at the front end of generating good, likely mapping candidates, leading to faster acceptance by analysts of the final mappings. This approach is in keeping with Cognonto’s philosophy to emphasize “semi-automatic” mappings that combine fast final assignments with the highest quality. Maintaining mapping quality is the sine qua non of knowledge-based artificial intelligence.
Some key findings from this use case are:
- A capable mapper, including structural considerations, is essential for quickly generating mapping candidates for entities, concepts, vocabularies, schema and ontologies
- A capable mapper, including structural considerations, is essential for accurate mapping candidates for entities, concepts, vocabularies, schema and ontologies
- Quality mappings require manual vetting for final acceptance
- The Cognonto Mapper generates quick and accurate candidates for linking entities, concepts, vocabularies, schema or ontologies
- The Cognonto Mapper can generate “gold standards” in 15% of the time of standard approaches
- The Cognonto Mapper scoring can help point to likely mappings even where the candidates are ambiguous
- The Cognonto Mapper can identity likely, but unknown, entities for inspection and commitment to the knowledge base, and
- Other structural aspects of KBpedia, such as aspects, relations or attributes, can inform other comparators and mappers.
See the original use case links for further details, code examples, and results and statistics. As noted, we will announce additional use cases as they are published.