
If you are like me, you like to clear the decks before the start of major new projects. In Structured Dynamics‘ case, we actually have multiple new initiatives getting underway, so the deck clearing has been especially focused this time.
As a result, we have updated Sweet Tools, AI3’s listing of semantic Web and -related tools, with the addition of some 30 new tools, updates to others, and deletions of five expired entries. The dataset now lists 835 tools. And, as before, there is also now a new structured data view via conStruct (pick the Sweet Tools dataset).
We have also updated SWEETpedia, a listing of 246 research articles that use Wikipedia in one way or another to do semantic-Web related research. Some 20 new papers were added to this update.
Please use the comments section on this post to suggest new tools or new research articles for inclusion in future updates.

Sweet Tools, AI3’s listing of semantic Web and -related tools, now has a total of 810 tools listed, a significant expansion from the last update. With the retirement of 19 prior tools, this new listing represents an increase of 93 tools, or 13%, from the previous version that listed 736.
The Sweet Tools dataset is also now showing the way to a couple of exciting innovations: new generic ontology-driven applications for structured data; and, tools for authoring structured data via spreadsheets.
So, here is the summary of major changes in this new listing:
A completely new structured data view of the listing, courtesy of Structured Dynamics‘ structWSF and conStruct open source frameworks. This version can be viewed on the conStruct SCS Web site (pick the Sweet Tools dataset). You can compare this server-side presentation and version to the client-side JavaScript version using Exhibit that has been part of this blog for some timeTo see the major Sweet Tools page for this updated listing in its existing format, filter on ‘New’ under New or Existing? to see the recent additions. Alternatively, you can also see this same filtering using the conStruct structured data view by searching on the Status attribute using the value ‘New’; see example here.
Though still formative, the most exciting change with the Sweet Tools listing is this new presentation via conStruct. It is a structured data Web services framework with a UI, all offered as a set of modules to Drupal. To kick the tires with this new system, you may want to look at:
BTW, there are some helpful documentation pages that show how all of these various tools work and more, such as, for example, Browse. (Also, BTW, as a demo user, you also are not seeing all of the write and update tools, either; again, see the documentation.)
The essential underlying basis to conStruct is the structWSF Web services framework. There are still some aspects to this system that we feel are incomplete and we are working on. Some of these things include dropdown selections (controlled vocabulary selects); easier template creation; and intuitive template re-use. Nonetheless, these additions will come quickly, and what is here is already a great demonstration of how structured data can drive generic tools and interfaces.
The case study of how this system was constructed from a spreadsheet input using the irON vocabulary is described in an earlier post.
The updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories are browser tools (RDF, OWL), information extraction, parsers or converters, composite application frameworks and general ontology tools. Each accounts for more than 8% — or more than 50 tools — of the total. This breakdown is as follows (click to expand):
As for the languages these applications are written in, that has stayed pretty steady, too. Java is still the leading language at about 46%, which has been very slightly trending downward over the past three years or so. PHP has increased a bit as well. The current splits are (click to expand):
Background on prior listings and earlier statistics may be found on these previous posts:
With interim updates periodically over that period.
In a former life, I had the nickname of ‘Spreadsheet King’ (perhaps among others that I did not care to hear). I had gotten the nick because of my aggressive use of spreadsheets for financial models, competitors tracking, time series analyses, and the like. However, in all honesty, I have encountered many others in my career much more knowledgeable and capable with spreadsheets than I’ll ever be. So, maybe I was really more like a minor duke or a court jester than true nobility.
Yet, pro or amateur, there are perhaps 1 billion spreadsheet users worldwide [1], making spreadsheets undoubtedly the most prevalent data authoring environment in existence. And, despite moans and wails about how spreadsheets can lead to chaos, spaghetti code, or violations of internal standards, they are here to stay.
Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value “locked up” in existing spreadsheets.
How to “unlock” this value for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) [2]. I recently published a case study [3] that describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools [4] to aid users in understanding and using the commON notation. I summarize portions of that study herein.
The dataset that is the focus of this use case, Sweet Tools, began as an informal tracking spreadsheet about four years ago. I began it as a way to learn about available tools in the semantic Web and -related spaces. I began publishing it and others found it of value so I continued to develop it.
As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact. The current version has well over 800 tools listed, characterized by many structured data attributes such as type, programming language, description and so forth. As it has grown, a formal controlled vocabulary has also evolved to bring consistency to the characterization of many of these attributes.
It was natural for me to maintain this listing as a spreadsheet, which was also reinforced when I was one of the first to adopt an Exhibit presentation of the data based on a Google spreadsheet about three years back. Here is a partial view of this spreadsheet as I maintain it locally:
When we began to develop irON in earnest as a simple (”naïve”) dataset authoring framework, it was clear that a comma-separated value, or CSV [5], option should join the other two serializations under consideration, XML and JSON. CSV, though less expressive and capable as a data format than the other serializations, still has an attribute-value pair (also known as key-value pairs and many other variants [6]) orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.
As a dataset very familiar to us as irON’s editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically what came to be known as the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.
The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.
A large portion of the case study describes the many advantages of authoring small datasets within spreadsheets. The useful thing about the CSV format is that these full functional capabilities of the spreadsheet are available during authoring or later updates and modifications, but, when exported, the CSV provides a relatively clean format for processing and parsing.
So, some of the reasons for small dataset authoring in a spreadsheet include:

The next major section of the case study deals with the minor conventions that must be followed in order to stage spreadsheets for commON. Not much of the specific commON vocabulary or notation is discussed below; for details, see [7].
Because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.
To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.
In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically ‘escaped‘ when issued as CSV. commON allows you to specify your own delimiter for lists (the standard is the pipe ‘|’ character) and what the parser recognizes as the ‘escape’ character (’\’ is the standard). However, you probably should not change for most conditions.
The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.
In the irON specification there are a small number of defined modules or processing sections. In commON, these modules are denoted by the double-ampersand character sequence (’&&‘), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage). Similarly, attributes and types are denoted by a single ampersand prefix (&attributeName).
In commON, any or all of the modules can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.
The first spreadsheet figure above shows a Sweet Tools example for the &&recordList module. The module begins with that keyword, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attributeName convention in the columns for the first row after the &&recordList keyword is encountered).
The &&recordList format can also include the stacked style (see similar Dataset example below) in addition to the single row style shown above.
At any rate, once a worksheet is ready with its instance records following the straightforward irON and commON conventions, it can then be saved as a CSV file and appropriately named. Here is an example of what this “vanilla” CSV file now looks like when shown again in a spreadsheet:
Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:
Note that the CSV format separates each column by the comma separator, with escapes shown for the &description attribute when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).
The &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset [8]. Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the single row or stacked styles (see these instance record examples), with this one being the stacked style:
The &&linkage module is used to map the structure of the instance records to some structural schema, which can also include external ontologies. The module has a simple, but specific structure.
Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted [8]. By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:

Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.
In its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets [2]. Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.
A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future [8].
If the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved. These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.
My own naming convention is to use a format of appname_date_modulename.csv because it sorts well in a file manager accommodating multiple versions (dates) and keeps related files clustered. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format. Thus, in the case of the records listings for Sweet Tools, its filename could be something like: swt_20091110_records.csv.
Once saved, these files are now ready to be imported into a structWSF [9] instance, which is where the CSV parsing and conversion to interoperable RDF occurs [8]. In this case study, we used the Drupal-based conStruct SCS system [10]. conStruct exposes the structWSF Web services via a user interface and a user permission and access system. The actual case study write-up offers more details about the import process.
We are now ready to interact with the Sweet Tools structured dataset using conStruct (assuming you have a Drupal installation with the conStruct modules) [10].
The screen capture below shows a couple of aspects of the system:
One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!
Here are a few sample things you can do with these generic tools driven by the Sweet Tools dataset:
Note, if you access this conStruct instance you will do so as a demo user. Unfortunately, as such, you may not be able to see all of the write and update tools, which in this case are reserved for curators or admins. Recall that structWSF has a comprehensive user access and permissions layer.
Of course, one of the real advantages of the irON and structWSF designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations [8]:
As should be obvious, one of the real benefits of the irON notation — in addition to easy dataset authoring — is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.
The formal Sweet Tools case study based on commON, with sample download files and PDF, is available from Annex: A commON Case Study using Sweet Tools, Supplementary Documentation [3].
Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array..
It has been eight months since the last major update to Sweet Tools, AI3’s listing of semantic Web and -related tools. With today’s release, there are now a total of 810 tools listed, crashing through the sound barrier of 761 tools. With the retirement of 19 prior tools, this new listing represents an increase of 93 tools, or 13%, from the previous version that listed 736.
But simply adding to the tools listing is not the cause of this longer than normal period between updates.
This little Sweet Tools dataset is now showing the way to a couple of exciting innovations: new generic ontology-driven applications for structured data; and, tools for authoring structured data via spreadsheets.
We deal with the former in this post. I will deal with the spreadsheet business in a subsequent post.
So, here is the summary of major changes in this new listing:
A completely new structured data view of the listing, courtesy of Structured Dynamics‘ structWSF and conStruct open source frameworks. This version can be viewed on the conStruct SCS Web site (pick the Sweet Tools dataset). You can compare this server-side presentation and version to the client-side JavaScript version using Exhibit that has been part of this blog for some timeTo see the major Sweet Tools page for this updated listing in its existing format, filter on ‘New’ under New or Existing? to see the recent additions. Alternatively, you can also see this same filtering using the conStruct structured data view by searching on the Status attribute using the value ‘New’; see example here.
Though still formative, the most exciting change with the Sweet Tools listing is this new presentation via conStruct. It is a structured data Web services framework with a UI, all offered as a set of modules to Drupal. To kick the tires with this new system, you may want to look at:
BTW, there are some helpful documentation pages that show how all of these various tools work and more, such as, for example, Browse. (Also, BTW, as a demo user, you also are not seeing all of the write and update tools, either; again, see the documentation.)
The essential underlying basis to conStruct is the structWSF Web services framework. There are still some aspects to this system that we feel are incomplete and we are working on. Some of these things include dropdown selections (controlled vocabulary selects); easier template creation; and intuitive template re-use. Nonetheless, these additions will come quickly, and what is here is already a great demonstration of how structured data can drive generic tools and interfaces.
As I said: More on this in a later post.
The updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories are browser tools (RDF, OWL), information extraction, parsers or converters, composite application frameworks and general ontology tools. Each accounts for more than 8% — or more than 50 tools — of the total. This breakdown is as follows (click to expand):
As for the languages these applications are written in, that has stayed pretty steady, too. Java is still the leading language at about 46%, which has been very slightly trending downward over the past three years or so. PHP has increased a bit as well. The current splits are (click to expand):
Background on prior listings and earlier statistics may be found on these previous posts:
With interim updates periodically over that period.
Note: Because of comments expirations on prior posts, this entry is now the new location for adding a suggested new tool. Simply provide your information in the comments section, and your tool will be included in the next update.
I have been meaning to write on the semantic enterprise for some time. I have been collecting notes on this topic since the publication by PricewaterhouseCoopers (PWC) of an insightful 58-pp report earlier this year [1]. The PWC folks put their finger squarely on the importance of ontologies and the delivery of semantic information via linked data in that publication.
The recent publication of a special issue of the Cutter IT Journal devoted to the semantic enterprise [2] has prompted me to finally put my notes in order. This Cutter volume has a couple of good articles including its editorial intro [3], but is overall spotty in quality and surprisingly unexciting. I think it gets some topics like the importance of semantics to data integration and business intelligence right, but in other areas is either flat wrong or misses the boat.
The biggest mistake are statements such as “. . . a revolutionary mindset will be needed in the way we’ve traditionally approached enterprise architecture” or that the “. . . semantic enterprise means rethinking everything.”
This is just plain hooey. From the outset, let’s make one thing clear: No one needs to replace anything in their existing architecture to begin with semantic technologies. Such overheated rhetoric is typical consultant hype and fundamentally mischaracterizes the role and use of semantics in the enterprise. (It also tends to scare CIOs and to close wallets.)
As an advocate for semantics in the enterprise, I can appreciate the attraction of framing the issue as one of revolution, paradigm shifts, and The Next Big Thing. Yes, there are manifest benefits and advantages for the semantic enterprise. And, sure, there will be changes and differences. But these changes can occur incrementally and at low risk while experience is gained.
The real key to the semantic enterprise is to build upon and leverage the assets that already exist. Semantic technologies enable us to do just that.
Think about semantic technologies as a new, adaptive layer in an emerging interoperable stack, and not as a wholesale replacement or substitution for all of the good stuff that has come before. Semantics are helping us to bridge and talk across multiple existing systems and schema. They are asking us to become multi-lingual while still allowing us to retain our native tongues. And, hey! we need not be instantly fluent in these new semantic languages in order to begin to gain immediate benefits.
As I noted in my popular article on the Advantages and Myths of RDF from earlier this year:
That is still a key takeaway message from this piece. But, let’s look and list with a fresh perspective the advantages of moving toward the semantic enterprise [4].
For the interconnected reasons noted below, RDF and semantic technologies are inherently incremental, additive and adaptive. The RDF data model and the vocabularies built upon it allow us to progress in the sophistication of our expressions from pidgin English (simple Dick sees Jane triples or assertions) to elegant and expressive King’s English. Premised on the open world assumption (see below), we also have the freedom to only describe partial domains or problem areas.
From a risk standpoint, this is extremely important. To get started with semantic technologies we neither need to: 1) comprehensively describe or tackle the entire enterprise information space; nor 2) do so initially with precision and full expressiveness. We can be partial and somewhat crude or simplistic in our beginning efforts.
Also extremely important is that we can add expressivity and scope as we go. There is no penalty for starting small or simple and then growing in scope or sophistication. Just like progressing from a kindergarten reader to reading Tolstoy or Dickens, we can write and read schema of whatever complexity our current knowledge and understanding allow.
Semantic technology does not change or alter the fact that most activities of the enterprise are transactional, communicative or documentary in nature. Structured, relational data systems for transactions or records are proven, performant and understood. Writing and publishing information, sometimes as documents and sometimes as spreadsheets or Web pages, is (and will remain) the major vehicle for communicating within the enterprise and to external constituents.
On its very face, it should be clear that the meaning of these activities — their semantics, if you will — is by nature an augmentation or added layer to how to conduct the activities themselves. Moreover, as we also know, these activities are undertaken for many different purposes and within many different contexts. The inherent meaning of these activities is also therefore contextual and varied.
This simple truth affirms that semantic technologies are not a starting basis, then, for these activities, but a way of expressing and interoperating their outcomes. Sure, some semantic understanding and common vocabularies at the front end can help bring consistency and a common language to an enterprise’s activities. This is good practice, and the more that can be done within reason while not stifling innovation, all the better. But we all know that the budget department and function has its own way of doing things separate from sales or R&D. And that is perfectly OK and natural.
These observations — in combination with semantic technologies — can thus lead to a conceptual architecture for the enterprise that recognizes there are “silo” activities that can still be bridged with the semantic layer:
Under this conceptual architecture, “RDFizers” (similar to the ETL function) or information extractors working upon unstructured or semi-structured documents expose their underlying information assets in RDF-ready form. This RDF is characterized by one or more ontologies (multiples are actually natural and preferred [5]), which then can be queried using the semantic querying language, SPARQL.
We have written at length about proper separation of instance records and data and schema, what is called the ABox and TBox, respectively, in description logics [6], a key logic premise to the semantic Web. Thus, through appropriate architecting of existing information assets, it is possible to leave those systems in place while still gaining the interoperability advantages of the semantic enterprise.
Another aspect of this information re-use is also a commitment to leverage existing schema structures, be they industry standards, XML, MDM, relational schema or corporate taxonomies. The mappings of these structures in the resulting ontologies thus become the means to codify the enterprise’s circumstances into an actionable set of relationships bridging across multiple, existing information assets.
Clearly, then, the first obvious benefit to the semantic enterprise is to federate across existing data silos, as featured prominently in the figure above. Data federation has been the Holy Grail of IT systems and enterprises for more than three decades. Expensive and involved efforts from ETL and MDM and then to enterprise information integration (EII), enterprise application integration (EAI) and business intelligence (BI) have been a major focus.
Frankly, it is surprising that no known vendors in these spaces (aside from our own Structured Dynamics, hehe) premise their offerings on RDF and semantic technologies. (Though some claim so.) This is a major opportunity area. (And we don’t mind giving our competitors useful tips.)
Instance-level records and the ABox work well with relational databases. Their schema are simple and relatively fixed. This is fortunate, because such instance records are the basis of transactional systems where performance and throughput are necessary and valued.
But at the level of the enterprise itself — what its business is, its business environment, what is constantly changing around it — trying to model its world with relational schema has proven frustrating, brittle and inflexible. Though relational and RDF schema share much logically, the physical basis of the relational schema does not lend itself to changes and it lacks the flexibility and malleability of the graph-based RDF conceptual structure.
Knowledge management and business intelligence are by no means new concepts for the enterprise. What is new and exciting, however, is how the emergence of RDF and the semantic enterprise will open new doors and perspectives. Once freed of schema constraints, we should see the emergence of “agile KM” similar to the benefits of agile software development.
Because semantic technologies can operate in a layer apart from the standard data basis for the enterprise, there is also a smaller footprint and risk to experimenting at the KM or conceptual level. More options and more testing and much lower costs and risks will surely translate to more innovation.
Just as semantic technologies are poorly suited for transactional or throughput purposes, we should see the complementary and natural migration of KM to the semantic side of the shop. There are no impediments for this migration to begin today. In the process, as yet unforeseen and manifest benefits in agility, experimentation, inferencing and reasoning, and therefore new insights, will emerge.
The same ontologies that guide the data federation and interoperability layer can also do double-duty as the specifications for data-driven applications. The premise is really quite simple: Once it is realized that the inherent information structure contained within ontologies can guide hierarchies, facets, structured retrievals and inferencing, the logical software design is then to “drive” the application solely based on that structure. And, once that insight is realized, then it becomes important, as a best practice, to add further specifications in order to also carry along the information useful for “driving” user interfaces [7].
Thus, while ontologies are often thought solely to be for the purpose of machine interpretation and communication, this double-duty purpose now tells us that useful labels and such for human use and consumption is also an important goal.
When these best practices of structure and useful human labels are made real, it then becomes possible to develop generic software applications, the operations of which vary solely by the nature of the structure and ontologies fed to them. In other words, ontologies now become the application, not custom-written software.
Of course, this does not remove the requirement to develop and write software. But the nature and focus of that development shifts dramatically.
From the outset, data-driven software applications are designed to be responsive to the structure fed them. Granted, specific applications in such areas as search, report writing, analysis, data visualization, import and export, format conversions, and the like, still must be written. But, when done, they require little or no further modification to respond to whatever compliant ontologies are fed to them — irrespective of domain or scope.
It thus becomes possible to see a relatively small number of these generic apps that can respond to any compliant structure.
The shift this represents can be illustrated by two areas that have been traditional choke points for IT within the enterprise: queries to local data stores (in order to get needed information for analysis and decisions) and report writers (necessary to communicate with management and constituents).
It is not unusual to hear of weeks or months delays in IT groups responding to such requests. It is not that the IT departments are lazy or unresponsive, but that the schema and tools used to fulfill their user demands are not flexible.
It is hard to know just how large the huge upside is for data-driven apps and generic tools. But, this may prove to be of even greater import than overcoming the data federation challenge.
In any event, while potentially disruptive, this prospect of data-driven applications can start small and exist in parallel with all existing ways of doing business. Yes, the upside is huge, but it need not be gained by abandoning what already works.
So, assume, then, a knowledge management (KM) environment supported by these data-driven apps. What perspective arises from this prospect?
One obvious perspective is where the KM effort shifts to become the actual description, nature and relationships of the information environment. In other words, ontologies themselves become the focus of effort and development. The KM problem no longer needs to be abstracted to the IT department or third-party software. The actual concepts, terminology and relations that comprise coherent ontologies now become the foundation of KM activities.
An earlier perspective emphasized how most any existing structure can become a starting basis for ontologies and their vocabularies, from spreadsheets to naïve data structures and lists and taxonomies. So, while producing an operating ontology that meets the best practice thresholds noted herein has certain requirements, kicking off or contributing to this process poses few technical or technology demands.
The skills needed to create these adaptive ontologies are logic, coherent thinking and domain knowledge. That is, any subject matter expert or knowledge worker worth keeping on the payroll has, by definition, the necessary skills to contribute to useful ontology development and refinement.
With adaptive ontologies powering data-driven apps we thus see a shift in roles and responsibilities away from IT to knowledge workers themselves. This shift acts to democratize the knowledge management function and flatten the organization.
Enterprise information systems, particularly relational ones, embody a closed world assumption that holds that any statement that is not known to be true is false. This premise works well where there is complete coverage of the entities within a knowledge base, such as the enumeration of all customers or all products of an enterprise.
Yet, in the real (”open”) world there is no guarantee or likelihood of complete coverage. Thus, under an open world assumption the lack of a given assertion or fact being available neither implies whether that possible assertion is true or false: it simply is not known. An open world assumption is one of the key factors for enabing adaptive ontologies to grow incrementally. It is also the basis for enabling linkage to external (and surely incomplete) datasets.
Fortunately, there is no requirement for enterprises to make some philosophical commitment to either closed- or open-world systems or reasoning. It is perfectly acceptable to combine traditional closed-world relational systems with open-world reasoning at the ontology level. It is also not necessary to make any choices or trade-offs about using public v. private data or combinations thereof. All combinations are acceptable and easily accommodated.
As noted, one advantage of open-world reasoning at the ontological level is the ability to readily change and grow the conceptual understanding and coverage of the world, including incorporation of external ontologies and data. Since this can easily co-exist with underlying closed-world data, the semantic enterprise can readily bridge both worlds.
Unfortunately, as a relatively new area there are advantages for some pundits or consultants to present the semantic Web as more complicated and commitment-laden than it need be. Either the proponents of that viewpoint don’t know what they are saying, or are being cynical to the market. The major point underlying the fresh perspectives herein is to iterate that it is quite possible to start small, and do so with low cost and risk.
While it is true that semantic technologies within the enterprise promise some startling upside potentials and disruptions to the old ways of doing business, the total beauty of RDF and its capabilities and this layered model is that those promises can be realized incrementally and without hard choices. No, it is not for free: a commitment to begin the process and to learn is necessary. But, yes, it can be done so with exciting enterprise-wide benefits at a pace and risk level that is comfortable.
The good news about the dedicated issue of the Cutter IT Journal and the earlier PWC publication is that the importance of semantic technologies to the enterprise is now beginning to receive its just due. But as we ramp up this visibility, let’s be sure that we frame these costs and benefits with the right perspectives.
The semantic enterprise offers some important new benefits not obtainable from prior approaches and technologies. And, the best news is that these advantages can be obtained incrementally and at low risk and cost while leveraging prior investments and information assets.
I have been a participant in an interesting series of discussions recently: Whither goes ‘linked data’?
As I described to someone, I was clearly not a father to the idea of ‘linked data‘, but I was handing out cigars pretty close on to the birth. Chris Bizer and Richard Cyganiak were the innovators that first proposed the original project to the W3C [1]. (Thanks guys!)
From that point forward, now a bit over 2-1/2 years ago, we have seen a massive increase in attention and visibility to the idea of ‘linked data.’ I take a small amount of reflected pride that I helped promote the idea in some way with my early writings.
That visibility was well-deserved. After all, here was the concept:
Much other puffery got layered on to those ideas, but I think those premises are the key basis.
My first personal concern with where linked data was going dealt with an absence of context or conceptual structure for how these new datasets related to one another. I will not repeat those arguments here; simply see many of my blog postings from the past two years or so. Exposing millions of “things” was wonderful, but what did all of that mean? How does one “thing” relate to another “thing”? Are some “things” the same as or similar to other things? If nothing else, these concerns stimulated the genesis of the UMBEL subject concept ontology, an outcome for which I need to thank the community.
It would be petty of me to question the basis that attracted millions of data items to get exposed from linked data techniques. In fact, the richness we have today in exposed Web data objects comes solely from this linked data initiative. But, nonetheless, my guess is that even the most ardent linked data advocate would have a hard time finding a logical way to present the current linked data reality in context. We see the big bubble diagram of available datasets, but, frankly, the position and relationships amongst datasets appears somewhat arbitrary. We have lots of bubbles, but little meaning.
The semantic Web was in serious crisis prior to linked data. It had bad perception, little delivery, and unmet hype. Linked data at least began to show how exposed and properly characterized data can begin to become interconnected.
For a couple of years now I have tried in various posts to present linked data in a broader framework of structured and semantic Web data. I first tried to capture this continuum in a diagram from July 2007:
![]() |
|||
| Document Web | Structured Web | Semantic Web | |
| Linked Data | |||
|
|
|
|
The point is not whether those earlier characterizations were “correct”, but that linked data be properly seen as merely a natural step in an ongoing transition. IMO, we are progressing nicely along this spectrum.
Linked data is a set of techniques — nothing more — and certainly not a philosophy or meme (whatever the hell that means). We have way too many breathy pontifications about “linked data this” and “linked data that” that frankly are undercutting the usefulness of the practice and making it a caricature of itself.
In the enterprise world we see similar attempts at marketing that need to give everything a three-letter acronym. In this case, we have a bunch of academics and researchers trying to act like market and business gurus. All it is doing is confusing the marketplace and hurting the practice.
The elevation of techniques or best practices into roles clearly beyond their pay grade produces completely the opposite effect: the idea comes under question and ridicule. The logic and rationale for why we should be following these best practices gets lost in the hyperbole. I spend most of my time hitting the delete button on the mailing lists. I fear what others new to these practices — that is, my company’s customers and prospects — perceive when they look into this topic.
Linked data is useful and needed. But come on, folks, these are not tribal or religious matters.
Through the initial project vehicle of DBpedia and then how it nucleated other “linked” data sets, the linked data practice certainly became viral. Today, we have many millions of data items available in linked data form. This is unalloyed goodness.
I will continue to use the phrase ‘linked data’ to refer to those useful techniques noted in the opening. Actually, I think it is best to think of linked data as a set of best practices, but by no means an end unto itself.
Beyond linked data we need context, we need our data to be embedded and related to interoperable ontologies, we need much better user interfaces and attainability, and we need quality in our assertions and use. These are issues that extend well beyond the techniques of linked data and form the next set of challenges in gaining broader acceptance for the semantic Web and the semantic enterprise.
Like most everything else in this world, there are real problems and real needs out there. Thankfully, we have heard mostly the end of the silliness about Web 3.0. Perhaps we can now also broaden our horizons beyond the useful techniques of linked data to tackle the next set of semantic challenges.
So, let me be the first to congratulate the community on a victory well achieved! As for myself and my company, we will now focus our attentions on the next tier of challenges. It is time to deprecate the rhetoric. Huzzah!
Thanks to all who responded to my last update post, More than 200 Semantic Web-related Papers Using Wikipedia, with suggestions for more papers to add the updated SWEETpedia listing.
Those inputs resulted in another 20 added papers. This listing of semantic Web-related research papers based on Wikipedia contents and structure now numbers some 227 papers. The added entries since the major update last week are now marked as [NEWEST].
Thanks, again, those who commented or emailed suggestions. I will, of course, continue to stockpile further suggestions for subsequent updates.
The Message Understanding Conferences (MUC) were initiated in 1987 and financed by DARPA to encourage the development of new and better methods of information extraction (IE). It was a seminal series that resulted in basic measures of retrieval and semantic efficacy, recall (R) and precision (P) and the combined F-measure, and other core terminology and constructs used by IE today.
By the sixth version in the series (MUC-6), in 1995, the task of recognition of named entities and coreference was added. That initial slate of named entities included the basic building blocks of person (PER), location (LOC), and organization (ORG); to these were added the numeric building blocks of time, percentage or quantity. The very terminology of named entity was coined for this seminal meeting, as was the idea of inline markup [1].
The intuition surrounding “named entity” and nameable “things” was that they were discrete and disjoint. A rock is not a person and is not a chemical or an event. As initially used, all “named entities” were distinct individuals. But, there also emerged the understanding that some classes of things could also be treated as more-or-less distinct nameable “things”: beetles are not the same as frogs and are not the same as rocks. While some of these “things” might be a true individual with a discrete name, such as Kermit the Frog, or The Rock at Northwestern University, most instances of such things are unnamed.
The “nameability” (or logical categorization) of things is perhaps best kept separate from other epistemological issues of distinguishing sets, collections, or classes from individuals, members or instances.
In a closed-world system it is easier to enforce clean distinctions. The Cyc knowledge base, for example, the basis for UMBEL (Upper Mapping and Binding Exchange Layer), makes clear the distinction between individuals and collections. In the semantic Web and RDF, this can become smeared a bit with the favored terminology shifting to instances and classes, and in pragmatic, real-world terms we (as humans) readily distinguish John Smith as distinct from Jane Doe but don’t generally (unless we’re entomologists!) make such distinctions for individual beetles, let alone entire genera or species of beetles.
Under precise conditions, these distinctions are important. The fact that Cyc, for example, is assiduous in its application of these distinctions is a major reason for the overall coherence of its knowledge base. But, for most circumstances, we think it is OK to accept a distinction between “nameable” things such as frogs and beetles, but also to accept that there may be nameable individuals at times in those groupings such as Kermit that are truly an individual in that more refined sense.
This digression sets the background for a natural progression from that first MUC-6 conference. If we could cluster persons or organizations, why not other categories of distinct and disjoint things such as frogs or beetles or rocks?
From the first six entity categories of MUC-6 we begin to see an expansion to broader coverage. Readers of this blog will recall that I have been a fan for quite some time of the expanded coverage of 64 classes of entities proposed by BBN or the 200 proposed by Sekine [2] (as discussed, for example in the April 2008 Subject Concepts and Named Entities article). Again, the intuition was that real things in the real world could be logically categorized into discrete and disjoint categories.
Thus, “named entities” inexorably moved to become a categorization system, where the degree of familiarity and distinction dictated whether it was the individual (with a unique name, such as Abraham Lincoln or Mt. Rushmore) or groupings such as animal or plant species and their common names (such as beetle or oak) that was the standard “handle” for assigning a name to the “nameable thing”.
While many can argue these individual <–> grouping distinctions and whether we are talking about true, unique, named individuals or names of convenience, I think that (at least for this blog post and discussion), that misses the real, fundamental point.
The real, fundamental point is that some “things” (whether individuals, instances or classes) are distinct from other “things”. Such disjoint distinctions are a powerful concept that should not be lost sight of by “angels dancing on the head of a pin” epistemological arguments. A frog is not a rock, despite neither are “individuals”, and how can we take advantage of that realilty?
Nearly from the outset of our work with UMBEL as a ‘TBox’ [3] — that is, as a set of 20,000 or so common “subject concepts” — the natural question was what the relation or correspondence was of these concepts to the underlying “things” (entities) that they organized. As we probed the disjoint categories within the Sekine 200 entity types, for example, we began to see significant parallels and overlap. Also gnawing at our sense of order was the rather artificial and arbitrary class of concepts in UMBEL that we termed “Abstract Concepts”.
We introduced Abstract Concepts in the first release of UMBEL. When introduced, we defined “Abstract concepts [as] representing abstract or ephemeral notions such as truth, beauty, evil or justice, or [as] thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world.” In pragmatic terms, Abstract Concepts in UMBEL were often pivotal nodes in the UMBEL subject graph necessary to maintain a high degree of concept interconnectivity.
In any world view that attempts to be more-or-less comprehensive, there is a gradation of concepts from the concrete and observable to the abstract and ephemeral. The recognition that some of these concepts may be more abstract, then, was not the issue. The issue was that there was no definable basis for segregating a concrete Subject Concept from the more Abstract Concept. Where was the bright line? What was the actionable distinction?
Off and on we have probed this question for more than a year, and have looked at what might constitute a more natural and logical ordering and segmentation within UMBEL. After many tests and detailed analysis, we are now releasing the first results of our investigations.
For, like nameable entities or things, we can see a logical segmentation of (mostly) disjoint concepts within the UMBEL TBox. Here are the summary percentages of these high-level splits:
| Disjoint Concepts | 90% |
| Attributes | 1% |
| Classifications | 9% |
| TOTAL | 100% |
(Because the analysis is still being refined, exact counts and percentages for the 20,000 concepts in UMBEL are not provided.)
As we dove deeper into these ideas, not only could we see the basis for a logical segmentation within UMBEL’s concepts, but manifest benefits from doing so as well. Remember that UMBEL’s concept structure performs two main roles. It: 1) provides a coherent framework for relating and “mapping” other external ontologies; and 2) provides conceptual binding points for organizing entities and instances [4]. Via logical segmentation, we get benefits for both roles.
Here are some of the broad areas of benefit from a logical UMBEL segmentation that we have identified:
With these benefits in mind, we have undertaken concerted analysis of UMBEL to discern what this “logical segmentation” might be. This investigation has occurred over three concentrated periods over the past year. (Intervening priorities or other work prevented concentrating solely on this task.)
We are now complete with our first full iteraton of investigation. In this post, and then the subsequent release of UMBEL version 0.80 in the coming weeks, the fruits of this effort should be evident. However, it should also be noted that we are still learning much from this new mindset and approach. UMBEL structure refinement may be likely for some time to come.
Most things and concepts about them are based on real, observable, physical things in the real world. Because most of these things can not occupy both the same moment in time and the same location in physical space, a useful criterion for looking at these things and concepts is disjointedness.
In a broad sense, then, we can split our concepts of the world between those ideas that are disjoint because they pertain to separable objects or ideas and those that are cross-cutting or organizational or classificatory. Attributes, such as color (pink, for example), are often cross-cutting in that they can be used to describe quite disparate things. Inherent classification schemes such as academic fields of study or library catalog systems — while useful ways to organize the world — are not themselves in-and-of the world or discrete from other ideas. Thus, classificatory or organizational concepts are inherently not disjoint.
With the criterion of disjointedness in hand, then, we began an evaluation process of the UMBEL subject concepts. We looked to organizational schema such as the entity types of Sekine or BBN for some starting guidance. We also kept in mind that we also wanted our categories to inform logical clusterings of possible data presentation, such as media types or locations or time.
For terminology, we adopted the term superType to denote the largest cluster designation upon which this disjointedness may occur. As a way to test the basic coherence of these superTypes, we also collected them into larger groups which we termed dimensions.
Our analysis process began with branch-by-branch testing of the UMBEL concept graph using automated scripts, attempting to find pivotal nodes where child instance members were disjoint from other superTypes. This we term the “top-down” method.
This automated analysis was then supplemented with a complete manual inspection of all unassigned and assigned concepts, with a “bottom up” assignment of concepts or corrections to the automated approach. This inspection then led to new insights and identification of missing concepts that needed to be added into UMBEL.
We are still converging between these two methods. Optimally, we should be able to tease out all UMBEL superTypes with a relatively few number of union, intersection, or complement set operations. In its current form, we are close, but there are still some rough spots.
Nonetheless, this analysis method has led us to identify some 33 superTypes [5], clustered into 9 dimensions. Of these, 29 superTypes and 8 dimensions are mostly disjoint. The one dimension of Classificatory includes the four cross-cutting superTypes of attributes and organizational schema that can apply to any of the 29 disjoint superTypes.
Here is the schema, with the descriptions of each:
| Dimension | superType | Description/Sub-types |
| Natural World | Natural Phenomena | This superType includes natural phenomena and natural processes such as weather, weathering, erosion, fires, lightning, earthquakes, tectonics, etc. Clouds and weather processes are specifically included. Also includes climate cycles, general natural events (such as hurricanes) that are not specifically named, and biochemical processes and pathways. |
| Natural Substances | Notable inclusions are minerals, compounds, chemicals, or physical objects that are not the outcome of purposeful human effort, but are found naturally occurring. Other natural objects (such as rock, fossil, etc.) are also found under this superType. | |
| Earthscape | The Earthscape superType consists mostly of the collection of cartographic features that occur on the surface of the Earth. Positive examples include Mountain, Ocean, and Mesa. Artificial features such as canals are excluded. Most instances of these features have a fixed location in space.
Underground and underwater are also explicitly contained. This superType is explicitly disjoint with Extraterrestrial (see below). |
|
| Extraterrestrial | This superType includes all natural things not specifically terrestrial, including celestial bodies (planets, asteroids, stars, galaxies, etc., that can be located within a sky map) | |
| Living Things | Prokaryotes | The Prokaryotes include all prokaryotic organisms, including the Monera, Archaebacteria, Bacteria, and Blue-green algas. Also included in this superType are viruses and prions. |
| Protists or Fungus | This is the remaining cluster of eukaryotic organisms, specifically including the fungus and the protista (protozoans and slime molds). | |
| Plants | This superType includes all plant types and flora, including flowering plants, algae, non-flowering plants, gymnosperms, cycads, and plant parts and body types. Note that all Plant Parts are also included. | |
| Animals | This large superType includes all animal types, including specific animal types and vertebrates, invertebrates, insects, crustaceans, fish, reptiles, amphibia, birds, mammals, and animal body parts. Animal parts are specifically included. Also, groupings of such animals are included. Humans, as an animal, are included (versus as an individual Person). Diseases are specifically excluded. | |
| Diseases | Diseases are atypical or unusual or unhealthy conditions for (mostly human) living things, generally known as conditions, disorders, infections, diseases or syndromes. Diseases only affect living things and sometimes are caused by living things. This superType also includes impairments, disease vectors, wounds and injuries, and poisoning | |
| Person Types | The appropriate superType for all named, individual human beings. This superType also includes the assignment of formal, honorific or cultural titles given to specific human individuals. It further includes names given to humans who conduct specific jobs or activities (the latter case is known as an avocation). Examples include steelworker, waitress, lawyer, plumber, artisan. Ethnic groups are specifically included. | |
| Human Activities | Organizations | Organization is a broad superType and includes formal collections of humans, sometimes by legal means, charter, agreement or some mode of formal understanding. Examples include geopolitical entities such as nations, municipalities or countries; or companies, institutes, governments, universities, militaries, political parties, game groups, international organizations, trade associations, etc. All institutions, for example, are organizations.
Also included are informal collections of humans. Informal or less defined groupings of humans may result from ethnicity or tribes or nationality or from shared interests (such as social networks or mailing lists) or expertise (”communities of practice”). This dimension also includes the notion of identifiable human groups with set members at any given point in time. Examples include music groups, cast members of a play, directors on a corporate Board, TV show members, gangs, mobs, juries, generations, minorities, etc. Finally, Organizations contain the concepts of Industries and Programs and Communities. |
| Finance & Economy | This superType pertains to all things financial and with respect to the economy, including chartable company performance, stock index entities, money, local currencies, taxes, incomes, accounts and accounting, mortgages and property. | |
| Culture, Issues, Beliefs | This category includes concepts related to political systems, laws, rules or cultural mores governing societal or community behavior, or doctrinal, faith or religious bases or entities (such as gods, angels, totems) governing spiritual human matters. Culture, Issues, beliefs and various activisms (most -isms) are included | |
| Activities | These are ongoing activities that result (mostly) from human effort, often conducted by organizations to assist other organizations or individuals (in which case they are known as services, such as medicine, law, printing, consulting or teaching) or individual or group efforts for leisure, fun, sports, games or personal interests (activities) | |
| Human Works | Products | This is the largest superType and includes any instance offered for sale or performed as a commercial service. Often physical object made by humans that is not a conceptual work or a facility, such as vehicles, cars, trains, aircraft, spaceships, ships, foods, beverages, clothes, drugs, weapons. Products also include the concept of ’state’ (e/g/., on/off) |
| Food or Drink | This superType is any edible substance grown, made or harvested by humans. The category also specifically includes the concept of cuisines | |
| Drugs | This superType is an drug, medication or addictive substance | |
| Facilities | Facilities are physical places or buildings constructed by humans, such as schools, public institutions, markets, museums, amusement parks, worship places, stations, airports, ports, carstops, lines, railroads, roads, waterways, tunnels, bridges, parks, sport facilities, monuments. All can be geospatially located.
Facilities also include animal pens and enclosures and general human “activity” areas (golf course, archeology sites, etc.). Importantly, Facilities include infrastructure systems such as roadways and physical networks. Facilities also include the component parts that go into making them (such as foundations, doors, windows, roofs, etc.) |
|
| Information | Chemistry (n.o.c) | This superType is a residual category (n.o.c., not otherwise categorized) for chemical bonds, chemical composition groupings, and the like. It is formed by what is not a natural substance or living thing (organic) substance. |
| Audio Info | This superType is for any audio-only human work. Examples include live music performances, record albums, or radio shows or individual radio broadcasts | |
| Visual Info | This superType includes any still image or picture or streaming video human work, with or without audio. Examples include graphics, pictures, movies, TV shows, individual shows from a TV show, etc. | |
| Written Info | This superType includes any general material written by humans including books, blogs, articles, manuscripts, but any written information conveyed via text. | |
| Structured Info | This information superType is for all kinds of structured information and datasets, including computer programs, databases, files, Web pages and structured data that can be presented in tabular form | |
| Notations & References | Akin to conceptual works, these are codified means of human expression. Examples range from human languages themselves, to more domain-specific cases such as chemical symbols, genetic code (A-G-C-T), protocols, and computer languages, mathematical and set notations, etc.
Identifiers (numeric or alphanumeric identifiers for objects, often in a highly patterned way, such as phone numbers, URLs, zip and postal codes, SKUs, product codes, etc.), Units (any of the various ways in which measurement, space, volume, weight, speed, intensity, temperature, calories, siesmic intensity or other quantitative descriptions of phenomena can be made) and key reference types are also included in this superType |
|
| Numbers | This unique superType is for any abstract representation of numbers and numerics | |
| Human Places | Geopolitical | Named places that have some informal or formal political (authorized) component. Important subcollections include Country, IndependentCountry, State_Geopolitical, City, and Province. |
| Workplaces, etc. | These are various workplaces and areas of human activities, ranging from single person workstations to large aggregations of people (but which are not formal political entities) | |
| Time-related | Events | These are nameable occasions, games, sports events, conferences, natural phenomena, natural disasters, wars, incidents, anniversaries, holidays, or notable moments or periods in time |
| Time | This superType is for specific time or date or period (such as eras, or days, weeks, months type intervals) references in various formats | |
| Descriptive | Attributes | This general superType category is for descriptive attributes of all kinds. Think of the specific attributes in Wikipedia “infoboxes” to understand the purpose and coverage of this superType. It includes colors, shapes, sizes, or other descriptive characteristics about an object |
| Classificatory | Abstract-level | This general superType category is largely composed of former AbstractConcepts, and represent some of the more abstract upper-level nodes for connecting the UMBEL structure together. This superType also includes theories or processes or methods for humans to do stuff or any human technology |
| Topics/Categories | This largely subject-oriented superType is a means for using controlled vocabularies and classification schemes for characterizing what content “is about”. The key constituents of this category are Types, Classifications, Concepts, Topics, and controlled vocabularies | |
| Markets & Industries | This superType is a specialized classificatory system for markets and industries. It could be combined with the superType above, but is kept separate in order to provide a separate, economy-oriented system. |
These may undergo some further refinement prior to release of UMBEL v 0.80, and some of the definitions will be tightened up.
(Note: It should also be mentioned that some of these superTypes further lend themselves to further splits and analysis. The Product superType, for example, is ripe for such treatment.)
The following diagram shows the distribution of these 20,000 UMBEL concepts across major area. By far the largest superType is Products, even with further splits into Food and Drinks and Pharmaceuticals. The next largest categories are Person and Places and Events superTypes, with Organizations and Animals not far behind:
Even in its generic state, UMBEL provides a very rich vocabulary for describing things or for tying in more detailed external ontologies. There are nearly 5,000 concepts across products of all types, for example.
You may recall that our analysis showed 29 of the superTypes to be “mostly disjoint.” This is because there are some concepts — say, MusicPerformingAgent — that can apply to either a person or a group (band or orchestra, for example). Thus, for this concept alone, we have a bit of overlap between the normally disjoint Person and Organization superTypes.
The following shows the resulting interaction matrix where there may be some overlap between superTypes:
This kind of interaction diagram is also useful for further analyzing the concept graph structure, as well.
Of the 29 “mostly” disjoint superTypes, only a relatively few show potential interactions, and then only in minor ways. We can illustrate this (drawn to scale) for the interaction between the Product, Food & Drink and Drug (Pharmaceuticals) superTypes, with the fully disjoint Organization superType thrown in for comparison:

Across all 20,000 concepts, then, fully 85% are disjoint from one another (5% is lost due to overlaps between “mostly” disjoint superTypes). This is a surprising high percentage, with even better likelihood to deliver the benefits previously noted.
These are exciting findings that bode well for UMBEL’s ongoing role and usefulness. Also, the very detailed analysis that has led to these interim findings very much reaffirms the wisdom of basing UMBEL on Cyc. Cyc showed itself to be admirably coherent and remarkably complete. (It also appears that the first versions of UMBEL were also extracted well in terms of good coverage.)
This approach now gives us an understandable and defensible basis for logical segementation of UMBEL. It also provides a much-desired alternative to the earlier Abstract Concepts, which will now be dropped entirely as a schema concept.
One area deserving further attention is in the Attribute superType. We are in the process, for example, of analyzing attributes across Wikipedia and need to look through a slightly different lens at this superType [6]. This area is further important in its strong interaction with the Instance Record Vocabulary that is accompanying this effort on the entity side.
Another lesson for us has been to back away from the terminology of named entity, introduced at MUC-6. The expansions of that idea into other “nameable” things has caused us to embrace the “instance” nomenclature, as evidenced by our emerging IRV.
It is rewarding to prepare this next iteration release of UMBEL with its new mindset of logical segmentation and disjointedness. But — what is also clear — there are many treasures left to mine still hidden in the inherent structure of UMBEL and its Cyc parent.
Sekine’s extended hierarchy proposed in 2002 is made up of 200 subtypes, with 32 larger clusters within that. Here is the top level of the Sekine type system:
| Name-Other | Title | Timex | Frequency |
| Person | Unit | Periodx | Rank |
| Organization | Vocation | Numex-Other | Age |
| Location | Disease | Money | School Age |
| Facility | God | Stock Index | Latitude Longitude |
| Product | ID Number | Point | Measurement |
| Event | Color | Percent | Countx |
| Natural Object | Time-Other | Multiplication | Ordinal Number |
Though developed separately and for different purposes, BBN categories also proposed in 2002 consists of 29 types and 64 subtypes. Here are the BBN types (Note: BBN claims 29 types because there are double entries or considerations for the first five entries):
| Person | Time | Animal |
| NORP (adjectival GPEs) | Percent | Substance |
| Facility | Money | Disease |
| Organization | Quantity | Work of Art |
| GPE (geopolitical places) | Ordinal | Law |
| Location | Cardinal | Language |
| Product | Events | Contact Info |
| Date | Plant | Game |
Of course, other entity extraction systems have similar clusterings and approaches. Though less formal in the sense of a hierarchy or purported complete entity coverage, here for example is the listing of entity types within Calais:
| Anniversary | FaxNumber | NaturalFeature | RadioProgram |
| City | Holiday | OperatingSystem | RadioStation |
| Company | IndustryTerm | Organization | Region |
| Continent | MarketIndex | Person | SportsEvent |
| Country | MedicalCondition | PhoneNumber | SportsGame |
| Currency | Movie | Position | SportsLeague |
| EmailAddress | MusicAlbum | Product | Technology |
| EntertainmentAwardEvent | MusicGroup | ProgrammingLanguage | TVShow |
| Facility | NaturalDisaster | ProvinceOrState | TVStation |
| PublishedMedium | URL |
See further the Wikipedia entry on named entity recognition.