Posted:July 31, 2020

This Standard Ontology Editor/IDE is an Essential Part of Your Toolkit

Though there are commercial alternatives, one essential part of your starting toolkit to work with ontologies (a term we use interchangeably with knowledge graph, though not all researchers do) is the Protégé editor. Protégé is an open-source ontology development framework (IDE) with more than 370,000 users. Protégé comes in two versions: one for the desktop, now in version 5.x, and one that is Web-based. We will be working with the desktop version for the Cooking with Python and KBpedia series.[1][2]

If you already have Protégé installed and are pretty comfortable with it, you may skip this installment. Otherwise, let’s spend about 15-30 min of effort so that you can set up your own local environment to work with KBpedia.

You first need to download and install Protégé. Go to the Protégé download page and follow the instructions for your particular operating system. You should fill out the new user registration (though you can claim you are already registered and still download it directly). The version I installed for this example is version 5.50 (though any of the version 5.2 forward should be fine as well.) The Protégé distribution comes as a zip file, so you should unzip it into a directory of your choice. To complete the set-up you will also need the most recent version of Java installed on your machine; it you do not have it, here are installation instructions.

Next, to start up Protégé, invoke the executable in your Protégé directory. It will take a few seconds for the program to load. Once the main screen appears, go to File and then Open from URL, and then pick, say, http://protege.stanford.edu/ontologies/camera.owl, as shown by (1):

Protégé Open URL Screen
Figure 1: Protégé Open URL Screen

We’ll get into KBpedia in earnest in the next installment, but if you want an early peek, you could also enter either https://github.com/Cognonto/kbpedia/blob/master/versions/2.50/kko.n3 (KBpedia upper ontology) or https://github.com/Cognonto/kbpedia/blob/master/versions/2.50/kbpedia_reference_concepts.zip (the full KBpedia, which you will need to unzip in a Web-accessible location and update this URL) into the dialog box in Figure 1. (Note: you may need to update the version reference to a later version depending on when you read this.) You will note that the next screen shots use the ‘full’ KBpedia example.

Upon entry, you will see the Protégé main screen as shown in Figure 2. Let me briefly cover some of the main conventions of the program. The three key structural aspects of the Protégé program are its main menu, its tab structure, and the views (or panes) shown for each tab where it appears on the standard interface (5). At start-up we always begin at the Active ontology tab, for which I highlight some of its key panes and functionality:

Main Protégé Screen
Figure 2: Main Protégé Screen

The ontology header section (1) is where all of the metadata for the knowledge graph resides. Such material includes title, creators, version notes and so forth. The metrics for the ontology resides in the second view (2). In this case, for example, this version of KBpedia has about 58,000 classes (reference concepts) and more than 5,000 properties. We also see in the third view (3) that KBpedia requires the SKOS and KKO ontology imports. Also note the search button (4), which we will use frequently, and the tab structure and order (5). We will modify that structure in later installments.

Because Protégé, like many integrated development environments (IDEs), is highly configurable, let’s detour for a short step to see how we can modify how our program looks. I am going to delete and add tabs to make the tab structure conform to the remaining screen shots.

To change tabs in Protégé, let’s refer to Figure 3:

Adding Tab Views
Figure 3: Adding Tab Views to Protégé

We effect the general layout of the system using the Window → Tabs option from the main menu. You delete a tab by clicking on the arrow shown for each tab as presented in the standard interface. You add tabs by selecting one of the options in the Tabs menu (2). Note that active tabs are indicated by the checkmark ( ). New tabs are added to the right of the tab sequence (3). Thus, to change the ordering of tabs, one must delete and then add tabs in the order desired. You can follow these steps if you want the tab ordering to reflect the screen shots below. This same main menu Window option is where you can change the views (panes) for each tab.

When these class tabs are to your liking, we can apply these same conventions and approaches to the properties (relations) for the knowledge graph, as I show in Figure 4. First, note (1) we have split our properties into three groups: object properties, data properties, and annotation properties:

Initial Object Property View
Figure 4: Initial View from the Object Property Tab

These are the standard splits in the OWL language. How we use these splits and their relation to the guidance of Charles Sanders Peirce is described in later installments. In essence, object properties are those that connect to an item (with a URI or IRI) already in the system; data properties are literal strings and descriptions connected to the subject item; and annotation properties are those that describe or point to the item. We’ll just use an object property example here, though the use and navigation applies to the other two property categories as well.

The Object properties tab in Figure 4 also has a search function (2), exactly similar to what was described for classes. We also see a tree structure at the left that works the same as for classes (3). As before, you can use a combination of scrolling, tree expansions and searching to discover the other properties in your knowledge graph. Do make sure and check out the Data properties and Annotation properties tabs as well.

Throughout this CWPK series we will be using examples from Protégé and comparing them to direct interaction with the code base using Python. These later installments will cover most of the standard use and maintenance cases you will likely encounter with your knowledge graphs.

A Note on Performance and Preferences

You may experience some performance issues with Protégé as it comes out of the box, especially as we begin working with the relative large KBpedia in earnest. One likely cause are the memory settings that you may find in the run.bat file that you can find in the main directory where you installed Protégé. As a quick fix, try updating these settings in that file to these values before the next time you start the application:

-Xmx2500M -Xms2000M

Also note there are many customization options in Protégé. If you get captivated with the tool, I encourage you to explore the plugins available and the ways to modify the application interface. See especially File → Preferences, with the Renderer and Plugin tabs good places to look. Again, we will touch on some of these aspects in later articles.

Some Suggested Protégé Resources

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Endnotes:

[1] Parts of this article were posted in a previous blog post, Bergman, Michael K. 2019. “First Twitch with KBpedia.” AI3:::Adaptive Information. https://www.mkbergman.com/2202/first-twitch-with-kbpedia/ (April 1, 2020).
[2] The Web-based version is great for collaboration, but does not include all of the features of the desktop version and can not handle very large ontologies, such as KBpedia as fully expressed.

Posted by AI3's author, Mike Bergman Posted on July 31, 2020 at 9:15 am in CWPK, KBpedia, Semantic Web Tools | Comments (1)
The URI link reference to this post is: https://www.mkbergman.com/2331/cwpk-5-overview-and-installation-of-protege/
The URI to trackback this post is: https://www.mkbergman.com/2331/cwpk-5-overview-and-installation-of-protege/trackback/
Posted:July 30, 2020

We’ll Try to be as ‘Pythonic’ as Possible in the Design

In past efforts, we have produced self-contained semantic technology platforms — for one, the Open Semantic Framework, since retired — based on similar objectives to what we have set for this CWPK series. However, with Cooking with Python and KBpedia, our audience is the newbie committed to learn more, not the enterprise. It may be that the approaches presented in this series may be adapted for enterprise use, but in order to maximize the training value of this series we prefer to emphasize off-the-shelf ‘glue-together’ components utilizing a fairly easy to learn and common language, Python. Our objective here is not commercial performance and security, but learnability and understandability.

Our design places the knowledge graph at the center, as shown below, surrounded by Python-based applications shown in yellow. The knowledge graph in our instance, KBpedia, is written in the W3C standard Web ontology language of OWL 2. However, what we are outlining here, including the possible extensions of KBpedia into your own domain of interest, can apply to any knowledge graph using World Wide Web Consortium (W3C) open standards. The language, as we implement it, embraces the other W3C standards of the Resource Description Framework (RDF) and its schema extension (RDFS). We also use an implementation of RDFS called SKOS (Simplified Knowledge Organization System), which is useful for providing a language of hierarchies and classification and labels familiar to librarians and information scientists.[1] Note all of these standards are completely independent of Python, or any programming language for that matter. These standards follow description logics and enable logical manipulation and analysis of their knowledge representations (KR).

Historically, many programming languages have been used to manage, store, and manipulate these W3C standard KR languages. For at least the past 15 years, Java has been the dominant programming language for semantic technology applications, most often accounting for more than half of all tools.[2] From an enterprise standpoint, Java-based applications may still be the most defensible choice. But we want our architecture to embrace a single language, Python, that has great connections in some areas, perhaps weak ones in others. Nonetheless, like any language choice, there are trade-offs. Working through those trade-offs for Python is an explicit topic in this CWPK series.

The architecture diagram below reflects these considerations. At the top we have inputs into the Python-based system, based on electronic notebooks, Web templates where user interactions send directives to the system, or direct command line interfaces (CLI). Because they are interactive and can display invoked apps, we will be using the electronic notebook interface for most installments in this series. We include some CLI stuff for quick responses. And, we include Web page examples of how one might drive these Python-based applications based on choices by users in their Web site interactions. This latter input style is very important, since interaction with knowledge graphs should be a distributed activity across normal workflows. Stopping to invoke a separate application space whenever new knowledge is encountered or questioned is unnatural and leads to little or no adoption. If we are to take advantage of these knowledge technologies, we must integrate them into our current work activities.

These possible sources of input would be best served by having a Python interface or API that maps the basic class, instance, property, and value perspectives of the W3C standards into native Python constructs. This will allow us to abstract knowledge graph specifications into natural Python code. We show this unspecified (at this time) ‘OWL API / Mappings’ component in green in the diagram. This pivotal component will receive much attention throughout the ensuing series.

This Python input is geared to access and manage the knowledge graph, shown at the bottom of the diagram. The knowledge graph needs its own storage to be persistent. (We do not spend further time on this component, other than to say that systems should be designed to interface with external storage, not incorporate specific ones. Storage is a commodity component.) Ontologies, or knowledge graphs, already have an excellent open-source integrated design environment (IDE) in the Protégé application, developed by Stanford University.

We can see these major components in the following diagram. The Python components are shown in yellow; the knowledge graph (KBpedia) in gray; and external tools for the knowledge graph in blue. Two split boxes show that both existing, external apps and Python ones are possible for those functions:

CWPK Basic Architecture
Figure 1: CWPK Basic Architecture

The diagram shows that inputs or requests of the knowledge graph may come from specific functional components such as querying (SPARQL), rule-setting (SWRL), or programmatic ones coming from user interfaces or external requests (yellow and orange). Also, in a loosely-coupled manner, we want outputs from our system to be flexible enough to tailor to various file formats or external APIs. This interface point is where using the system to, say, power machine learning or natural language applications, among all external systems, resides. Knowing how to stage and format outputs is a key task of the design.

Protégé plays an integral role in this architecture. It is firstly the common denominator for talking about the system, since this tool is ubiquitous in the semantic technology space. Secondly, most users have only manipulated knowledge graphs through this interface. Our Python-based system must duplicate this functionality, plus show how we can greatly bridge past it. Moreover, there are many ontology or knowledge graph management tasks where Protégé is the go-to choice. Searching, navigating, and visualizing are some of the key strengths of Protégé. The objective is not to replace Protégé, but to complement it. Protégé has an organizational view of knowledge graphs; what we want is a knowledge view of knowledge graphs. We thus use Protégé as a common touchstone as we work through our installments.

Protégé can host reasoners, as can our Python code, which is why that component is shown in dual blue-yellow colors. Another dual component is the build routines. This part of the architecture is deceptively critical, since we need to both: 1) logically test the knowledge graph for coherence and consistency as we add to or build it; and 2) enable round-tripping between build and W3C formats.

Among perhaps others, I see two payoffs to the pursuit of an architecture such as this. One, we can gain a dual programmatic and interactive environment for managing and keeping a knowledge graph current. And, two, we provision an engine for feeding external APIs in areas such as machine learning, natural language understanding, and interoperability.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Endnotes:

[1] At certain points in this CWPK series we will offer links to learning resources about these W3C languages. However, we assume you know their basics. The emphasis here is on the programming language Python to interoperate
with these standards.

Posted by AI3's author, Mike Bergman Posted on July 30, 2020 at 9:51 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2329/cwpk-4-the-baseline-architecture/
The URI to trackback this post is: https://www.mkbergman.com/2329/cwpk-4-the-baseline-architecture/trackback/
Posted:July 29, 2020

Choosing a Language for the CWPK Series

We will be developing many scripts and mini-apps in this series on Cooking with Python and KBpedia. Of course, we already know from the title of this series that we will be using Python, among other tools that I will be discussing in the next installments. But, prior to this point, all of our KBpedia development has been in Clojure, and R has much to recommend it for statistical applications and data analysis as well. Why we picked Python over these two worthy alternatives is the focus of this installment.

Our initial development of KBpedia — indeed, all of our current internal development — uses Clojure as our programming language. Clojure is a modern dialect of Lisp that runs in the Java virtual machine (JVM). it is extremely fast with clean code, and has a distinct functional programming orientation. We have been tremendously productive and pleased with Clojure. I earlier wrote about our experience with the language and the many reasons we initially chose it. We continue to believe it is a top choice for artificial intelligence and machine learning applications. The ties with Java are helpful in that most available code in the semantic technology space is written in Java, and Clojure provides straightforward ways to incorporate those apps into its code bases.

Still, Clojure seems to have leveled off in popularity, even though it is the top-paying language for developers.[1] So, recall from the introductory installment that our target audience is the newbie determined to gain capabilities in this area. If we are going to learn a language to work with knowledge graphs, one question to ask is, What language brings the most benefits? Popularity is one proxy for that answer, since popular tools create more network effects. Below is the ranking of popular scripting and higher-level languages based on a survey of 90,000 developers by Stack Overflow in 2019:[1]

Stack Overflow 2019 Developer Survey
Figure 1. Developer Popularity, 2019 [1]

Aside from the top three spots, which are more related to querying and dynamic Web pages and applications, Python became the most popular higher-level language in 2019, barely beating out Java. Python’s popularity has consistently risen over the past five years. It earlier passed C# in 2018 and PHP in 2017 in popularity.[1]

Of course, popularity is only one criterion for picking a language, and not the most important one. Our reason for learning a new language is to conduct data science with our KBpedia knowledge graph and to undertake other analytic and data integration and interoperability tasks. Further, our target audience is the newbie, dedicated to find solutions but perhaps new to knowledge graphs and languages. For these domains, Clojure is very capable, as our own experience has borne out. But the two most touted languages for data science are Python and R. Both have tremendous open-source code available and passionate and knowledgeable user communities. Graphs and machine learning are strengths in both languages. As Figure 1 shows, Python is the most popular of these languages, about 7x more popular than R and about 30X more popular than Clojure. It would seem, then, that if we are to seek a language with a broader user base than Clojure, we should focus on the relative strengths and weaknesses of Python versus R.

A simple search on ‘data science languages’ or ‘R python’ turns up dozen of useful results. One Stack Exchange entry [2] and a paper from about ten years ago [3] compare multiple relevant dimensions and links to useful tools and approaches. I encourage you to look up and read many of the articles to address your own concerns. I can, however, summarize here what I think the more relevant points may be.

R is a less complete language than Python, but has strong roots in statistics and data visualization. In data visualization, R is more flexible and suitable to charting, though graph (network) rendering may be stronger in Python. It is perhaps stronger than Python in data analysis, though the edge goes to Python for machine learning applications. R is perhaps better characterized as a data science environment rather than a language. Python gets the edge for development work and ‘glueing’ things together.[4]

Python also gets the edge in numbers of useful applications. As of 2017, the official package repository for Python, PyPI, hosted more than 100,000 packages. The R repository, CRAN, hosted more than 10,000 packages.[5] By early 2020, the packages on PyPI had grown to 225,000, while the R packages on CRAN totaled over 15,000. The Python contributions grew about 2.5x faster than the ones for R over the past three years. Many commentators now note that areas of past advantage for R in areas like data analysis and data processing pipelines have been equaled with new Python libraries like NumPy, pandas, SciPy, scikit-learn, etc. One can also use RPy2 to access R functionality through Python.

Performance and scalability are two further considerations. Though Python is an interpreted language, its more modern libraries have greatly improved the language’s performance. R, perhaps, is also not as capable for handling extremely large datasets, another area where add-in libraries have greatly assisted Python. Python was also an earlier innovator in the interactive lab notebook arena with iPython (now Jupyter Notebook). This interactive notebook approach grew out of early examples from the Mathematica computing system, and is now available for multiple languages. Notebooks are a useful documentation and interaction focus when doing data science development with KBpedia. Notebooks are a key theme in many of the KBpedia installments to come.

Lastly, from a newbie perspective, most would argue that Python is more readable and easier to learn than R. There is also perhaps less consistency in language and syntax approach across R’s contributed libraries and packages than what one finds with Python. We can also say that R is perhaps more used and popular in academia.[6] While Python is commonly taught in universities, it is also popular within enterprises, another advantage. We can summarize these various dimensions of comparison in Table 1:

Python R
Machine learning
Production
Libraries
Development
Speed
Visualizations
Big data
Broader applicability
Easier to learn
Used in enterprises
Used in academia
Table 1. Summary Data Science Comparison of R and Python (top portion from [2])

Capable developers in any language justifiably argue that if you know what you are doing you can get acceptable performance and sustainable code from any of today’s modern languages. From a newbie perspective, however, Python also has the reputation of getting acceptable performance with comparatively quick development even for new or bad programmers.[2] As your guide in this process, I think I fit that definition.

Another important dimension in evaluating a language choice is, How does it fit with my anticipated environment? The platforms we use? The skills and tools we have?

Our next installments in this series deal with our operating environment and how to set it up. A family of tools is required to effectively use and modify a large and connected knowledge graph like KBpedia. Language choices we make going in may interact well with this family or not. If problems are anticipated for some individual tools, we either need to find substitute tools or change our language choice. In our evaluation of the KBpedia tools family there is one member, the OWL API, that has been a critical linchpin to our work, as well as a Java application. My due diligence to date has not identified a Python-based alternative that looks as fully capable. However, there are promising ways of linking Python to Java. Knowing that, we are proceeding forward with Python as our language choice. We shall see whether this poses a small or large speed bump on our path. This is an example of a risk arising from due diligence that can only be resolved by being far enough along in the learning process.

The degree of due diligence is a function of the economic dependence of the choice. In an enterprise environment, I would test and investigate more. I would also like to see R and Python and Clojure capabilities developed simultaneously, though with most people devoted to the production choice. I have also traditionally encouraged developers with recognition and incentives to try out and pick up new languages as part of their professional activities.

Still, considering our newbie target audience and our intent to learn and discover about KBpedia, I have comfort that Python will be a useful choice for our investigations. We’ll be better able to assess this risk as our series moves on.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Endnotes:

[1] Stack Overflow. (2019, April 9). Developer Survey Results 2019. https://insights.stackoverflow.com/survey/2019
[2] Anon. (2017, January 17). Python vs R for Machine Learning. Data Science Stack Exchange. https://datascience.stackexchange.com/questions/326/python-vs-r-for-machine-learning/339#339
[3] Babik, M., & Hluchy, L. (2006). Deep Integration of Python with Semantic Web Technologies. 8.
[4] Anon. (2015, December). Is Python Better Than R for Data Science? Quora. https://www.quora.com/Is-Python-better-than-R-for-data-science
[5] Brittain, J., Cendon, M., Nizzi, J., & Pleis, J. (2018). Data Scientist’s Analysis Toolbox: Comparison of Python, R, and SAS Performance. SMU Data Science Review, 1(2), 20.
[6] Radcliffe, T. (2016, November 22). Python Versus R for Machine Learning and Data Analysis. Opensource.Com. https://opensource.com/article/16/11/python-vs-r-machine-learning-data-analysis

Posted by AI3's author, Mike Bergman Posted on July 29, 2020 at 10:39 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2328/cwpk-3-clojure-v-r-v-python/
The URI to trackback this post is: https://www.mkbergman.com/2328/cwpk-3-clojure-v-r-v-python/trackback/
Posted:July 28, 2020

A Peek at Forthcoming Topics in the Cooking with Python and KBpedia Series

Throughout this Cooking with Python and KBpedia series I want to cover some major topics about how to use, maintain, build, and extend knowledge graphs, using our KBpedia knowledge system as the model. KBpedia is a good stalking horse for these exercises — really, more like recipes — because it has broad domain coverage and a modular design useful for modification or extension for your own domain purposes. At appropriate times throughout these exercises you may want to fork your own version of KBpedia and the accompanying code to whack at for your own needs.

Today, as I begin my first articles, I am anticipating scores of individual installments in our CWPK series. Some 55 of these installments and the associated code have already been written. Though the later portion of the entire series gets more complicated, I am hoping that the four months or so it will take to publish the anticipated 75 installments at the rate of one per business day will give me sufficient time to complete the series in time. We shall see.

My intent is not to provide a universal guide, since I will be documenting steps using a specific knowledge graph, a specific environment (Windows 10 and some AWS Ubuntu), and a specific domain (knowledge management and representation). As the series name defines, we will only be working with the Python language. We’ve made these choices because of familiarity, and our desire to produce a code base at the series’ conclusion that better enables users to modify and extend KBpedia from scratch. We also are focusing on the exploring newbie more than the operating enterprise. I do not touch on issues of security, code optimization, or scalability. My emphasis is more on simplicity and learning, not performance or efficiency. Those with different interests may want to skip some installments and consult the suggested resources. Still, truly to my knowledge, there is nothing like this series out there: a multiple installment, interactive and didactive environment for learning ‘soup-to-nuts‘ practical things about semantic technologies and knowledge graphs.

The first part of the installments deals with the design intent of the series, architecture, and selection and installation of tools and applications. Once we have the baseline system loaded and working, we explore basic use and maintenance tasks for a knowledge graph. We show how those are done with a common ontology development environment, the wonderful Protégé open-source IDE, as well as programmatically through Python. At various times we will interact with the knowledge base using Python programmatically or via the command line, electronic notebooks, or Web page templates. There are times when Protégé is absolutely the first tool of choice. But to extract the most value from a knowledge graph we also need to drive it programmatically, sometimes analyze it as a graph, do natural language tasks, or use it to stage training sets or corpora for supervised or unsupervised maching learning. Further, to best utilize knowledge graphs, we need to embed them in our daily workflows, which means interacting with them in a distributed, multiple-application way, as opposed to a standalone IDE. This is why we must learn tools that go beyond navigation and inspection. The scripts we will learn from these basic use and maintenance tasks will help us get famiiar with our new Python surroundings and to set a foundation for next topics.

In our next part, we begin working with Python in earnest. Of particular importance is finding a library or set of APIs to work directly with OWL, the language of the KBpedia knowledge graph. There are many modifications and uses of KBpedia that existing Python tools can aid. Having the proper interface or API to talk directly to the object types within the knowledge graph is essential. There are multiple options for how to approach this question, and no single, ‘true’ answer. Once we have selected and installed this library, we then need to sketch out the exact ways we intend to access, use and modify KBpedia. These actions then set our development agenda for finding and scripting Python tools into our workflow.

There are off-the-shelf Python tools for querying the knowledge graph (SPARQL), adding rules to the graph (SWRL), and visualizing the outputs. Because we are also using Python to manipulate KBpedia, we also need to understand how to write outputs to file, necessary as inputs to other third-party tools and advanced applications. Since this is the first concentrated section involved in finding and modifying existing Python code, we’ll also use a couple of installments to assemble and document useful Python coding fragments and tips.

Armed with this background, and having gotten our feet wet a bit with Python, we are now positioned to begin writing our own Python code to achieve our desired functionality. We begin this process by focusing on some standard knowledge graph modification tasks: adding or changing nodes, properties or labels (annotations) within our knowledge graph. Of course, these capabilities are available directly from Protégé. However, we want to develop our own codes for this process in line with how we build and test these knowledge graphs in the first place. These skills set the foundation for how we can filter and make subset selections as training sets for machine learning. One key installment from this part of the CWPK series involves how we set up comma-separated values (CSV) files in UTF-8 format, the standard we have adopted for staging for use by third-party tools and in KBpedia’s build process. We also discuss the disjoint property in OWL and its relation to the modular typology design used by KBpedia. Via our earlier Python-OWL mappings we will see how we can use the Python language itself to manipulate these OWL constructs.

The installments in this initial part of the series are about setting up and learning about our tools and environment so that we can begin doing new, real work against our knowledge graphs. The first real work to which we will apply these tools is the extraction-and-build routines by which we can produce a knowledge graph from scratch. One of the unusual aspects of KBpedia is that the knowledge graph, as large and as comprehensive as it may be, is built entirely from a slate of flat (CSV, in N3 format) input files. KBpedia, in its present version 2.50, has about five standard input files, plus five optional files for various fixes or overrides, and about 30 external mapping files (which vary, obviously, by the number of external sources we integrate). These files can be easily and rapidly edited and treated in bulk, which are then used as the inputs to the build process. The build process also integrates a number of syntax and logical checks to make sure the finally completed knowledge graph is consistent and satisfiable, with standardized syntax. As errors are surfaced, modifications get made, until the build finally passes its logic tests. Multiple build iterations are necessary for any final public release. One of the reasons we wanted a more direct Python approach in this series was to bootstrap the routines and code necessary to enable this semi-automated build process. The build portion, in particular, has been part of KBpedia’s special sauce from the beginning, but is a capability we have not yet documented nor shared with the world. Since its beginning in 2018, each release of KBpedia has been the result of multiple build cycles to produce a tested, vetted knowledge graph.

In exposing these standard methods, we also needed to add to them a complementary set of extraction routines. Getting familiar with extracting resources from a knowledge graph has many practical applications and is an easier way to start learning Python. This substantive portion of CWPK is new, and gives us the ability to break down a knowledge graph into a similar set of simple, constituent input files. We thus end up with a unique roundtripping environment that, while specific to KBpedia as expressed in this series, can be modified though the Python code that accompanies this series for potentially any knowledge graph. It has also resulted in a more generalized approach suitable for other knowledge graphs than our original internal design supported. We now have a re-factored, second-generation set of knowledge graph extract-and-build routines. Some modifications may be needed for other types of starting knowledge graphs, but we hope the steps documented can be readily adapted for those purposes based on how we have described and coded them. Because of these advances, we will witness a new version release of KBpedia along this path.

The last major portion of installments provides some recipes on how to use and leverage a knowledge graph. A portion of those installments involve creating machine learning training sets and corpora. We will tie into some leading Python applications and libraries to conduct a variety of supervised and unsupervised learning tasks including categorization, clustering, fine-grained entity recognition, and sentiment and relation extraction. We will touch upon the leading Python toolsets available, and gain recipes for general ways to work with these systems. We will generate and use graph and word embedding models to also support those tasks, plus new ones like summarization. We will undertake graph analytics and do random walks over the KBpedia graph to probe network concepts like community, influence, and centrality. We will tie into general natural language processing (NLP) toolkits and show the value for these uses that may be extracted from KBpedia. This rather lengthy part of our series includes how to set up an endpoint for our knowledge graph and how to tie into graphing and charting libraries for various visualizations.

Since I am not a professional programmer and certainly am no expert in Python, the codes that are produced and distributed in this series are intended as starting off points. Perhaps others may develop more polished and performant code for these purposes over time. I welcome such input and will do what we can can to bring awareness and distribution mechanisms to any such improvements. But, crude and simple as they may be, all of the Python tools we build during this series, plus the instructions in the why and how to do so as demonstrated through interactive Jupyter Notebooks, can help start you on the path to modify, delete, or extend what exists in KBpedia with your own domain graphs and data.

We devote the concluding installments in our CWPK series to how you may leverage these resources to tackle your own needs and domain. We also try to provide additional resources each step of the way to aid your own learning. In the aggregate, these installments cover an awful lot of ground. However, inch by inch, we will make reaching our end goal a cinch.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Posted by AI3's author, Mike Bergman Posted on July 28, 2020 at 9:39 am in CWPK, KBpedia, Semantic Web Tools | Comments (2)
The URI link reference to this post is: https://www.mkbergman.com/2327/cwpk-2-what-to-expect/
The URI to trackback this post is: https://www.mkbergman.com/2327/cwpk-2-what-to-expect/trackback/
Posted:July 27, 2020

Intro to an Ongoing Series of More than 70 Recipes to Work with KBpedia

We decided to open source the KBpedia knowledge graph in October 2018. KBpedia is a unique knowledge system that intertwines seven ‘core’ public knowledge bases — Wikipedia, Wikidata, schema.org, DBpedia, GeoNames, OpenCyc, and standard UNSPSC products and services. KBpedia’s explicit purpose is to provide a computable scaffolding and design for data interoperability and knowledge-based artificial intelligence (KBAI).

Written primarily in OWL 2, KBpedia includes more than 58,000 reference concepts, mapped linkages to about 40 million entities (most from Wikidata), and 5,000 relations and properties, all organized according to about 70 modular typologies. KBpedia’s upper structure is the KBpedia Knowledge Ontology. KKO weaves the major concepts from these seven core knowledge bases into an integrated whole based on the universal categories and knowledge representation insights of the great 19th century American logician, polymath and scientist, Charles Sanders Peirce.

We have continued to expand KBpedia’s scope and refine its design since first released. Though the entire structure has been available for download through a number of version releases, it is fair to say that only experienced semantic technologists have known how to install and utilize these files to their fullest. Further, we have an innovative ‘build-from-scratch’ design in KBpedia that has not yet been shared. Our objective in this ongoing series is to publish daily ‘recipes’ over a period of about four months on how the general public may learn to use the system and also to build it. With that knowledge, it should be easier to modify KBpedia for other specific purposes.

The mindset we have adopted to undertake this series is that of a focused, needful ‘newbie.’ The individual we have in mind may not know programming and may not know ontologies, but is willing to learn enough about these matters in order to move forward with productive work using KBpedia (or derivatives of it). Perhaps our newbie knows some machine learning, but has not been able to bring multiple approaches and tools together using a consistent text- and structure-rich resource. Perhaps our newbie is a knowledge manager or worker desirous of expanding their professional horizons. This focus leads us to the very beginning of deciding what resources to learn; these early decisions are some of the hardest and most impactful for whether ultimate aims are met. Those with more experience may skip these first installments, but may find some value in a quick scan nonetheless.

The first installments in our series begin with those initial decisions, move on to tools useful throughout the process, and frame how to load and begin understanding the baseline resources of the KBpedia open-source distribution. We then discuss standard knowledge management tasks that may be applied to the resources. One truism about keeping a knowledge system relevant and dynamic is to make sure the effort put into it continues to deliver value. We then begin to conduct work with the system in useful areas that grow in complexity from intelligent retrieval, to entity and relation extractions, natural language understanding, and machine learning. The intermediate part of our series deals with how to build KBpedia from scratch, how to test it logically, and how to modify it for your own purposes. Our estimate going in is that we will offer about 75 installments in this series, to conclude before US Thanksgiving. Aside from a few skipped days on holidays and such, we will post a new installment every business day between now and then.

The ‘P‘ in CWPK comes from using the Python language. In our next installment, we discuss why we chose it for this series over KBpedia’s original Clojure language roots. Because Python is a new language for us, throughout this series we document relevant aspects of learning that language as well. Adding new language aspects to the mix is consistent with our mindset to embrace the newbie. Even if one already knows a programming language, extracting maximum advantage from a knowledge graph well populated with entities and text is likely new. As we go through the process of Python earning its role in the CWPK acronym, we will take some side paths and find interesting applications or fun wrinkles to throw into the mix.

We will also be bringing this series to you via the perspective of our existing systems: Windows 10 on desktops and laptops, and Linux Ubuntu on Amazon Web Services (AWS) cloud servers. These are broadly representative systems. Unfortunately, our guidance and series will have less direct applicability to Apple or other Linux implementations.

Look for the next installment tomorrow. As we put out an installment per business day over the next four months, we’ll learn much together through this process. Please let me know how it is going or what you would like to learn. Let the journey begin . . . .

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Posted by AI3's author, Mike Bergman Posted on July 27, 2020 at 1:13 pm in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2324/cwpk-1-cooking-with-python-and-kbpedia/
The URI to trackback this post is: https://www.mkbergman.com/2324/cwpk-1-cooking-with-python-and-kbpedia/trackback/