
I recently began a series on the structured Web and its role in the continued evolution of the Internet. This next installment in the series probes in greater depth the question of What is structure? in reference to data and Web expressions, with an emphasis on terminology and definitions.
This post ties in with a new best-practices guide published by Chris Bizer, Richard Cyganiak, and Tom Heath — called the Linked Data Publishing Tutorial — that provides definitions and viewpoints from the perspective of the use of RDF (Resource Description Framework) and W3C practices. My initial post in this series and their tutorial occasioned Kingsley Idehen to post his own Linked Data and the Web Information BUS entry, adding the valuable perspective of practices and terminology going back to the early 1990s in object and relational database systems and standards such as ODBC.
All of these efforts share a desire to craft practices, language and terminology to help promote the availability and interoperability of useful data on the Web.
Some problematic terms include information resource, non-information resource, dereferencing, bnode, content negotiation, representation, URL re-writing, and others.
This piece only tackles ‘non-information resource‘ head on. Discussion of other problematic terms awaits another day.
These posts caused Kingsley and me to engage in a prolonged discussion about definitions and terms. I acted as the unofficial scribe, which I attempt to more generally capture and argue below. If you like the ideas below, you may credit both of us; if you don’t, ascribe any errors or omissions to me alone [1].
The basic observation related to the structured Web is that it is a transition phase from the initial document-centric Web to the eventual semantic Web. In this transitional phase, the Web is becoming much more data-centric. The idea of ‘linked data‘ also is a component of this transition, but is more precise in meaning because by definition the data must be expressed as RDF in order to aid interoperability.
The challenge is to convert existing Web pages and data into the structured Web with every resource accessible via an unambiguous URI. Insofar as this conversion also occurs to RDF, it will promote linked data interoperability.
Transitions of such a profound nature — even if short of the full vision of the semantic Web — create the need for new language and terminology to aid understanding and communication. Sometimes, as well, longstanding terms and practices may need to be refined or challenged. In any event, notions of simplistic versioning such as ‘Web 3.0‘ add little to understanding and communication.
Independent of the Internet or the Web, let’s begin our discussion about the nature of structure in its application to data [2,3]. Peter Wood, a professor of computer science at Birkbeck College at the University of London, provides succinct definitions of the “structure” of various types of data [4]:
We can thus view data structure as residing on a spectrum (also shown with “typical” storage and indexing frameworks on the top line):

For the past decades, structured data has been typically managed by database management systems (DBMSs) or spreadsheets, unstructured data by text indexing systems such as used by search engines or unindexed in file systems or repositories.
Semi-structured data models are sometimes called “self-describing” (or schema-less) [5]. The first known definition of semi-structured data dates to 1993 [6] by Peter Schäuble: “We call a data collection semistructured if there exists a database scheme which specifies both normalized attributes (e.g., dates or employee numbers) and non-normalized attributes (e.g., full text or images).” More current usage (see the Wood definition above) also includes the notion of labeled graphs or trees with the data stored at the leaves, with the schema information contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.
HTML tags within Web documents are a prime example of semi-structured data, as are text-based data transfer protocols or serializations [7]. Semi-structured data is also a natural intermediate form when “structure” is desired to be extracted from standard text through techniques generally called ‘information extraction’ (IE). For example, here is possible structure shown in yellow as might be extracted from a death notice or obituary [8]:
John A. Smith of Salem, MA died Friday at Deaconess Medical Center in Boston after a bout with cancer. He was 67. Born in Revere, he was raised and educated in Salem, MA. He was a member of St. Mary’s Church in Salem, MA, and is survived by his wife, Jane N., and two children, John A., Jr., and Lily C., both of Winchester, MA. A memorial service will be held at 10:00 AM at St. Mary’s Church in Salem.
This notice contains a great deal of information including names, places, dates and relationships, which once extracted, can be separately indexed or manipulated. Virtually all text-based documents can have similar structure extracted.
The Web has been a prime source of growth of semi-structured data, most often through text-based data serializations [9] and mark-up languages [10].
Depending on point of view or definition, RDF can either be called semi-structured or structured data. It resides squarely at the transition point between these two categories on this structural continuum.
A couple of Web concepts often cause confusion and difficulty for some users: 1) Uniform Resource Locators (URLs) v Uniform Resource Identifiers (URIs); and 2) the concept of “resources” themselves. As it happens, there was a pretty accessible discussion on URIs that was recently posted; I recommend that discussion on that topic. Instead, we’ll focus here on “resources.”
The concept of a resource is basic to the Web’s architecture and is used in the definition of its fundamental elements, including obviously URL and URI above. The essence of the semantic Web parlance, as well, is built around the notion of abstract resources and their semantic properties. The data model and languages of the Resource Description Framework (RDF) squarely revolve around this “resource” concept.
The first explicit definition of resource related to the Web is found in RFC 2396 in August 1998:
However, in the context of the Internet, not all resources so defined (such as a person or a company) can be retrieved, while electronic resources like an image or Web page can. Thus, a first challenge arises in the concept of resource and its locational address: some are actual and physical, others are abstract or referential.
According to Wikipedia’s discussion of resources:
The need to somehow make this distinction between actual or physical resources vs. abstract or referential resources led to much discussion and controversy sometimes known as the httpRange-14 issue [11], resolved by the W3C’s Technical Architecture Group (TAG) in 2005. The TAG defined a distinction between two resource types:
A successful HTTP request for an information resource results in a 200 message (“OK”, followed by transfer) from the Web server; if the request is for a resource that the Web server recognizes as existing but of the wrong type requested, the publisher can use the 303 redirect response to provide the correct URI [12].
These resource distinctions are very, very unfortunate in their labeling, if not on more fundamental grounds. It is non-sensical to call one category “information” and the other not, when everything is informational. Moreover, the distinctions bring absolutely no clarity. (Important note: actually, the provenance of the term ‘non-information resource‘ appears to be quite recent, as well as wrong and unfortunate [13]).
However, getting standards bodies to change labels is a long and uncertain task. The approach taken below is to stick with the ‘information resource‘ term, but to provide the alias of ‘structured data resource‘ in place of ‘non-informational resource‘ and to add some additional sub-category distinctions [14].
Even though a Web ‘resource‘ has an address scheme and other requirements, these are details and specifics that can mask the fundamental purpose of a resource to act as a “container” for encapsulating information of some sort. This encapsulation is what enables access to its “payload” information (Kingsley’s term) on the Web or the broader Internet via standard protocols (HTTP and TCP/IP). The mechanisms of the encapsulation constitute the details and specifics.
Inherent to the transition from the document Web to the structured Web is the increased importance of that most confusing category: non-information resource. That is because, like fragment identifiers, we are talking about objects more granular (subsidiary) to document-level resources and it is because we are now referencing structure that includes such “abstract” notions as classes, properties, types, namespaces, ontologies, schema, etc.
One paradox in all of this is the very category of resource designed to deal with these data issues is itself called by many a ‘non-informational resource‘. This term is non-sense [13]. We use instead ‘structured data resource.’
There is much that needs to be cleaned up and clarified over time regarding uses and nomenclature regarding resources. However, from the standpoint of the structured Web, we can probably for the time being concentrate on those items shown in bold in this table:
| Information Resource | current standard Web term unstructured and semi-structured data |
||
| Document Resource | text and markup within standard Web pages | ||
| ‘Other’ Resources | non-text resources with a URL (images, streaming media, non-text indexable files) | ||
| Non-information Resource (aka Structured Data Resource) [see text; 13] |
current standard Web term; non-sensical semi-structured and structured data |
||
| Structured Data | all non-RDF data-oriented resources, including non-RDF namespaces, etc. | ||
| Linked Data | all RDF |
Note that the two main resource categories used in practice are maintained. The ‘information resource’ category retains its traditional understanding. From the standpoint of the structured Web, the document resources are the unstructured and semi-structured data content from which information extraction (IE) techniques and software can extract the eventual structure.
The category of document resource likely represents the majority of potentially useful structural content on the Web and is most often overlooked in discussions of linked data or the semantic Web. This content, if subjected to IE and therefore structure creation, then becomes a URI resource better handled as a ‘structured data resource.’
‘Structured data resources‘ (that is, the poorly labeled ‘non-information resources‘) are the building blocks for the structured Web. In all cases, these resources are either semi-structured or structured data. There are two sub-categories of resources in this category, differentiated by whether the structured data is expressed in RDF (or RDF-based languages) or not. All RDF variants are called ‘linked data‘; all other forms are termed ‘structured data.’
Thus, we can see the path to the structured Web taking a number of different branches.
The first branch, and the one necessary for the largest portion of content, is to use a combination of IE techniques to extract entity information from unstructured text or to use structure extraction on the semi-structure of the document [15] to create the structured data resource for that document. Of all variants, this is the longest path, the one least developed, but one with potentially great value.
The second branch is to publish the structured data directly as a resource or to provide access to it through a Web service or API. This is the current basis for most of the structured data resources presently available on the Web. (It is also the outcome of IE for the first branch.)
The third branch is really just a complete variant of the other two — ensuring that the structured data resource is available as interoperable RDF linked data. There are two ways to proceed down this branch. One way is for the publisher to create and post the resource directly in a form of RDF. (Though the actual data can be serialized in a variety of ways such as RDF/XML, N3, Atom or Turtle, conversions between these forms is relatively straightforward.) The other way, less direct, is for the publisher or a third-party to convert non-RDF structured data into RDF with the rich and growing list of available ‘RDFizers’ [16].
This material, plus the earlier introduction to the structured Web, can now be brought together as a picture of the Web in transition. While there are no real beginning and end points, there is a steady progression from a document-centric Web to one that is data-centric, including the mediation of semantics:
![]() |
|||
| Document Web | Structured Web |
Semantic Web
|
|
| Linked Data | |||
|
|
|
|
The basic argument of this series is that we are in the midst of a transition phase — the structured Web — that marks the beginning of the dominance of data on the Web. In its broadest definition, the structured Web has many different data forms and serializations. A subset of the structured Web — namely, linked data — is a direct precursor to the semantic Web with its emphasis on RDF and data interoperability and services.
Another argument of this series is — despite the promise of linked data — that structured data resources in many forms will co-exist and provide alternatives. This diversity is natural. For RDF and linked data advocates, tools are now largely in place to convert these diverse forms. Though the ability to see large-scale availability of RDF data appears clear, the longer-term resolution of mediating heterogeneous semantics remains cloudy.
To re-cap, and to aid language and understanding, here is a brief glossary of the key terms used in this discussion:
[1] You will note a heavy emphasis on Wikipedia definitions in keeping with Web usage.
[2] Of course, the word ‘structure‘ has a broad range of meanings; we are only concerned here about data structure and its specific applicability to Web-related information.
[3] Much of the discussion in this sub-section is derived from an earlier AI3 posting, Semi-structured Data: Happy 10th Birthday!, from November 2005; some of the original information is a bit dated. It is also aided by Kingsley’s Structured Data v. Unstructured Data posting from June 2006. That posting notes the frequent confusion and ambiguity between the terms “structured data” and “unstructured data,” and the importance when speaking of structure to keep separate the structure of the data itself (the focus herein), the structure of the container that hosts the data, and the structure of the access method used to access the data.
[4] Peter Wood, School of Computer Science and Information Systems, Birkbeck College, the University of London. See http://www.dcs.bbk.ac.uk/~ptw/teaching/ssd/toc.html.
[5] The earliest known recorded mention of “semi-structured data” occurred in 1992 from N. J. Belkin and Croft, W. B., “Information filtering and information retrieval: two sides of the same coin?,” in Communications of the ACM: Special Issue on Information Filtering, vol. 35(12), pp. 29 – 38, with the first known definition from [6]. The next two mentions were in 1995 from D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman and J. Widom, “Querying semistructured heterogeneous information,” presented at Deductive and Object-Oriented Databases (DOOD ’95), LNCS, No. 1013, pp. 319-344, Springer, and M. Tresch, N. Palmer, and A. Luniewski, “Type classification of semi-structured data,” in Proceedings of the International Conference on Very Large Data Bases (VLDB). However, the real popularization of the term “semi-structured data” occurred through the seminal 1997 papers from S. Abiteboul, “Querying semi-structured data,” in International Conference on Data Base Theory (ICDT), pp. 1-18, Delphi, Greece, 1997 (http://dbpubs.stanford.edu:8090/pub/1996-19) and P. Buneman, “Semistructured data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117-121, Tucson, Arizona, May 1997 (http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz). Of course, semi-structured data had existed prior to these early references, only it had not been named as such.
[6] P. Schäuble, “SPIDER: a multiuser information retrieval system for semistructured and dynamic data,” in Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318 – 327, 1993.
[7] Such protocols first received serious computer science study in the late 1970s and early 1980s. In the financial realm, one early standard was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript, PDF (portable document format), RTF (rich text format), and the like. One of these proposed standards was the “standard generalized markup language” (SGML), first published in 1986. SGML was flexible enough to represent either formatting or data exchange. However, with its flexibility came complexity. Only when two simpler forms arose, namely HTML (HyperText Markup Language) for describing Web pages and XML (eXtensible Markup Language) for data exchange, did variants of the SGML form emerge as widely used common standards. The XML standard was first published by the W3C in February 1998, rather after the semi-structured data term had achieved some impact (http://www.w3.org/XML/hist2002) Dan Suciu was the first to publish on the linkage of XML to semi-structured data in 1998 (D. Suciu, “Semistructured data and XML,” in International Conference on Foundations of Data Organization (FODO), Kobe, Japan, November 1998; see PDF option from http://citeseer.ist.psu.edu/suciu98semistructured.html), a reference that remains worth reading to this day.
[8] David Loshin, “Simple semi-structured data,” Business Intelligence Network, October 17, 2005. See http://www.b-eye-network.com/view/1761. This example is actually quite complex and demonstrates the challenges facing IE software. Extracted entities most often relate to the nouns or “things” within a document. Note also, for example, how many of the entities involve internal “co-referencing,” or the relation of subjects such as “he” or clock times such as “10 a.m” to specific dates. A good entity extraction engine helps resolve these so-called “within document co-references.”
[9] Example text-based data serializations and formats used on the Web include Atom, Gdata, JSON, N3, pickle, RDF/XML, RSS, Turtle, XML, YaML.
[10] Example mark-up languages used on the Web include HTML, Wikitext, XHTML, XML.
[11] It was called the httpRange-14 issue by virtue of the agenda label at the TAG’s meetings.
[12] Of course, nothing compels the publisher to provide these instructions, but they are integral to “best practices” and publishers desirous of attracting consumers have incentives to follow them.
Depending on the nature of the information resource, an HTTP GET returns a 200 OK status and sends the resource (for example a request for a Web page or an RDF file) if the request is for the correct type. If the request is for the wrong type, the publisher can include in the header a 303 (See Other) redirect response to send the requester to the appropriate information resource URL. If the request is for an unknown URI or resource, a variety of 4xx error responses may result. (See further [13].)
One advantage of posting structured data resources as RDF- linked data is that this redirect can be sent to a REST-style Web Service that does a best-effort DESCRIBE of the resource using SPARQL. Since SPARQL is a query language, protocol, and results representation scheme, the redirect can come in the form of a URL that can directly query the structured data resource. In this manner, large structured data resource data sets can act as ‘endpoints’ for context-specific information linked anywhere on the Web.
[13] The report of the TAG’s 2005 ‘compromise solution’ was reported by Roy Fielding, with his public notice reproduced in full:
</TAG>
I believe that this solution enables people to name arbitrary
resources using the “http” namespace without any dependence on
fragment vs non-fragment URIs, while at the same time providing
a mechanism whereby information can be supplied via the 303
redirect without leading to ambiguous interpretation of such
information as being a representation of the resource (rather,
the redirection points to a different resource in the same way
as an external link from one resource to the other).
Note that point “a” discusses an “information resource,” with the contrasting treatment in point “b” as potentially “any resource”.
To my knowledge, the first that the unfortunate term ‘non-information resource’ was introduced to cover these point “b” conditions was in a draft (that is, unofficial) TAG finding from two years later, in May 31 of this year, “Dereferencing HTTP URIs.” Besides its unfortunate continuation of the ‘dereferencing’ term (a discussion for another day), it introduces the even-worse ‘non-information resource’ terminology. That draft TAG finding in Sec 2 talks about ‘information resources’ (as does the 2005 TAG finding), and in Sec 3 about ‘other Web resources’ (or the “any” category from the 2005 notice). In Sec 4, however, the document switches from ‘other Web’ to ‘non-information’, which is then continued through the rest of the document.
It is not too late for the community to cease using this term; our replacement suggestion is ‘structured data resource.’
[14] You may also want to see, Leo Sauermann, Richard Cyganiak, and Max Völkel “Cool URIs for the Semantic Web,” or Dan Connolly, “A Pragmatic Theory of Reference for the Web.”
[15] Web documents are typically semi-structured with embedded tag, metadata, and presentation structure in such things as headers, tables, layouts, labels, dropdown lists, etc. For the unstructured text content, traditional information extraction techniques are applied. But for the semi-structure, various scraping or extraction techniques may either be crafted by hand or be semi-automatically or automatically applied through regular expression processing, pattern matching, inspection of the Web page DOM or other techniques. One posting in this structured Web series is to be devoted to this topic.
[16] There is an impressive and growing list of data conversion protocols and tools, most which support multiple input and output forms, including RDF and various serializations of RDF. This list includes Virtuoso Sponger, GRDDL, Babel, RDFizers, general converters, Triplr, etc. One posting in this structured Web series is to be devoted to this topic.
Over the past few months I have increasingly been writing about and referring to the structured Web. I have done so purposefully, but, so far, with little background or explication. With the inauguration of this occasional series, I hope to bring more color and depth to this topic [1].
Literally, over the past year, I have been learning and documenting on AI3 my attempts to understand the basis, concepts and tools of the emerging semantic Web. In that process, I have come to define my own outlines of the Web past, present and future. Within this world view, I see the structured Web as today’s current imperative and reality.
Some Web pundits have embraced a versioning terminology of Web 2.0 and Web 3.0 to describe one such world view. I don’t personally agree with this silly versioning — indeed I poked fun in a tongue-in-cheek posting about Web 98.6 more than a year ago — but such terminology has gotten some traction and serves a purpose. I actually give my own definitions for such “versions” below if for no other reason than to close the gap with alternative world views.
We need not go back to the alternative early protocols of Usenet (and news groups), Gopher and FTP and their search engines of Veronica, WAIS, Jughead or Archie in 1991 [2] when Tim Berners-Lee first publicly announced the World Wide Web and its combination of hypertext with the Internet. More likely, the release of the Mosaic browser and CERN‘s decision to make access to the Web free in 1993 marked the true take-off point for the Web and the continued demise of the competing protocols.
Images and links in Web pages (“documents”) plus the HTML mark-up language to enable the styling and graphical design of those pages were very much in keeping with general trends, paralleling the earlier transition of personal computers to graphical interfaces and away from terminals. Mosaic became the foundation for the Netscape browser, best links compilations became a big hit through sites like Yahoo!, and the Lycos search engine, one of the first profitable Web ventures, indexed a mere 54,000 pages when it was publicly released in 1994 [3].
This initial start to the Web — today now referred to by some as ‘Web 1.0′ — can be squarely timed to 1993-1994. By 1995, the Web was appearing on the covers of major news magazines and by 1996 the phenomenon was at full throttle. But, since these early beginnings, the Web has gone through many different “versions” and transitions, most not fitting with version numbers, as some of these examples show:
Despite these differences in viewpoint, language does matter. Though some may view language as a contest in “branding,” which can legitimately apply in other venues, I think the issue here goes well beyond “branding.” Language is also necessary to aid communication.
As I explain below and elaborate upon more fully throughout this series, I believe one of the correct terms for the current evolutionary state of the Web is the ‘structured Web.’
As noted, portions of these trends and changes are more broadly combining to represent another transitional change in the Web from one solely focused on documents to one that is more object- or data-centric. Evidence of this trend includes such factors as:
One of the most popular series of presentations at this year’s WWW2007 conference in Banff was from the Linked Open Data project of the SWEO interest group. The members of this LOD project — comprised of accomplished advocates, developers and theorists — are providing the awareness, tools and example data that are showing how this emerging version may look. In fact, the group has just announced crossing the threshold of 1 billion ‘triples’ with 180,000 interlinks within its online DBpedia service, via these sources:

The LOD’s term for this effort is ‘linked data‘, and a Web site has been established to promote it. Others, harking back to Tim Berners-Lee’s original definition, refer to current efforts as a ‘Web of data’ or the ‘Semantic Data Web.’ Kingsley Idehen has been promoting the idea of ‘data spaces‘ — personal and collective — that is also a powerful metaphor.
Frankly, I think all of these terms are correct and useful. Yet I prefer the term structured Web because it is both more and less than some of these other terms.
The structured Web is more in that it pertains to any data formalism in use on the Web and includes the notion of extracting structure from uncharacterized content, by far the largest repository of potentially useful information on the Web. Yet the structured Web is also less because its ambition is solely to get that data into an interoperable framework and to forgo the full objectives of the ‘Semantic Web.’ In that regard, my concept of the structured Web is perhaps closest to the idea of linked data, though with less insistence on “correct” RDF and with specific attention to structure extraction from uncharacterized content.
One of today’s realities is that we have accomplished much but still have a long way to go to achieve the grand vision of the ‘Semantic Web’ (capitalized).
More than a year ago I wrote a piece on “Climbing the Data Federation Pyramid” that noted the tremendous progress that has been made in the last twenty years in overcoming many seemingly intractable issues in data interoperability, initially of a physical and hardware nature. The Internet and Web standards have made enormous contributions to that progress.
The diagram I used in that piece is shown below [7]. Reaching the pyramid’s pinnacle could be argued as having achieved the grand vision of the Semantic Web. With the adoption of the Internet and Web protocols, all layers up through data representation have largely been solved. Data representation, data models, schema for different world views, and means for reconciling and mediating those different world views are much of the focus of today’s conceptual challenges.
Note, as we discuss the structured Web that we are largely focusing on the layer dealing with data representation, with some minor portions (principally in disambiguation) dealing with semantics. Getting data into a canonical data representation or model still leaves very crucial challenges in what does the data mean (its semantics), reasoning over that data (inference and pragmatics), and whether the data is authoritative or can be trusted. These are the daunting — and largely remaining challenges — of the Semantic Web.

For example, let’s look solely at the layer of semantics, the immediate challenge after data representation. By semantics, we are referring to whether different statements from different sources indeed refer or not to the same entity or concept; in other words, have the same meaning. Such a determination is pivotal if we are to combine data from multiple sources.
The use of RDF, accurate name spaces and syntactically correct URIs aid this resolution, but do not completely solve it. Ultimately, semantic mediation (such as my “glad” is equivalent to your “happy”) means resolving or mediating potential heterogeneities from on the order of 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language, as shown in this table [8]:
| Class | Category | Sub-category |
| STRUCTURAL |
Naming |
Case Sensitivity |
| Synonyms | ||
| Acronyms | ||
| Homonyms | ||
| Generalization / Specialization | ||
| Aggregation | Intra-aggregation | |
| Inter-aggregation | ||
| Internal Path Discrepancy | ||
| Missing Item | Content Discrepancy | |
| Attribute List Discrepancy | ||
| Missing Attribute | ||
| Missing Content | ||
| Element Ordering | ||
| Constraint Mismatch | ||
| Type Mismatch | ||
| DOMAIN | Schematic Discrepancy | Element-value to Element-label Mapping |
| Attribute-value to Element-label Mapping | ||
| Element-value to Attribute-label Mapping | ||
| Attribute-value to Attribute-label Mapping | ||
| Scale or Units | ||
| Precision | ||
| Data Representation | Primitive Data Type | |
| Data Format | ||
| DATA | Naming | Case Sensitivity |
| Synonyms | ||
| Acronyms | ||
| Homonyms | ||
| ID Mismatch or Missing ID | ||
| Missing Data | ||
| Incorrect Spelling | ||
| LANGUAGE | Encoding | Ingest Encoding Mismatch |
| Ingest Encoding Lacking | ||
| Query Encoding Mismatch | ||
| Query Encoding Lacking | ||
| Languages | Script Mismatches | |
| Parsing / Morphological Analysis Errors (many) | ||
| Syntactical Errors (many) | ||
| Semantic Errors (many) | ||
Using the same data model (say, RDF) or the same name spaces (say, Dublin Core or FOAF) helps somewhat to remove some of these sources of heterogeneity, but not all. Undoubtedly, longer term, resolving these heterogeneities will prove tractable. But they are not easily so today.
This observation does not undercut the Semantic Web vision nor negate the massive labors in support of that vision taken to date. But, hopefully, this observation may bring some perspective to the task ahead to obtain that vision.
If nothing else, the reality of the past 15 years shows us that the Web is a “dirty,” chaotic place. If HTML coding can be screwed up, it will. If loopholes in standards and protocols exist, they will be exploited. If there is ambiguity, all interpretations become possible, with many passionately held. Innovation and unintended uses occur everywhere.
This should not be surprising, and experienced Web designers, scientists and technologists should all know this by now. There can be no disconnect between workable standards and approaches and actual use in the “wild.” Nuanced arguments over the subtleties of standards and approaches are bound to fail. Robustness, simplicity and forgiveness must take precedence over elegance and theoretical completeness.
While there has been obvious growth in the sophistication of Web sites and the underlying technologies that support them, we see continued use of obsolete approaches that clearly should have been abandoned long ago (such as Web-safe colors, small displays, older browser versions, Web pages parked on some servers that have not been modified or looked at by their original authors in a decade, etc.). We also see slow uptake for clearly “better” new approaches. And we also sometimes see explosive uptake of approaches and ideas that seemingly come out of nowhere.
We also see that those approaches that enjoy the greatest success — blogging, tagging, microformats, RSS, widgets, for example, come most recently to mind — are those that the “citizen” user can easily and readily embrace. HTML was pretty foreign at first, but now millions of users modify their own code. Millions of users of various CMS systems and Firefox have learned how to install plug-ins and add-ins and modify CSS themes and use administration consoles.
So, my observation and argument is not that we must always choose what is mindless and unchallenging. But my argument is that we must accept real-world diversity and seek simplicity, robustness and clarity for what is new.
After nearly a decade of standards work, the basis for beginning the transition to the semantic Web is in place. But that vision itself sometimes appears too demanding, too intimidating. The vision at times appears all too unreachable.
Of course, this perception is wrong. Measured over many years, perhaps some decades, the vision of the semantic Web is reachable. Much remains to be worked on and understood regarding this vision in terms of mediating and resolving semantic heterogeneities, capturing and expressing world views through formal ontologies, making inferences between these views, and establishing trust and authoritativeness. And those challenges do not yet address the even more-exciting prospects of intelligent and autonomous agents.
Rather, the rationale for the structured Web is to tone down the vision, stay with the here and now, focus on what is achievable today. And what is achievable today is very great.
Though version numbers for the Web are silly, with ‘Web 3.0′ for the semantic Web possibly being the silliest of all, such attempts do speak to the need for providing handles and language for capturing the dynamic change, diversity and complexity of the Web.
Today, right now, and all around us, a fundamental transition is taking place in the Web from a document-centric to a data-centric environment. A confluence of standards, advocacies, and previous trends are fueling this transition. Since the practical building blocks already exist, we will see this structured Web unfold before us at amazing speed.
The concept of the structured Web is thus narrower and less ambitious in scope than the ‘Semantic Web.’ The structured Web is merely a transitional step on the journey to the vision of the semantic Web, albeit one that can be fully realized today with current technologies and current understandings.
The purpose of this new series is thus to give prominence to this transition and to highlight the pragmatic, practical building blocks available to contribute to this transition. By somewhat shifting boundary definitions, the idea of the structured Web also aims to give more prominence to the importance of usability and structure extraction from semi-structured and unstructured content. These, too, are exciting areas with much potential.
So, as a way to provide a touchstone for continued discussion on this matter, here is one working definition of the structured Web:
Some of the tentative topics that I plan to address in this series include discussion of what constitutes ‘structure’ in content, why structure is important, the various existing forms of structure, human v. machine bases for viewing and interpreting structure, the importance of finding “canonical” representation forms while also appreciating real-world diversity, the means to convert data forms and serializations, the means to extract structure from all types of content, transitioning to semantic understandings, and likely others.
Others may be added to this series over time and will be categorized under ‘Structured Web‘ on the AI3 blog.
[1] You will note a heavy emphasis on Wikipedia definitions and histories in this piece, in keeping with the general theme of versions and transitions on the Web.
[2] News groups really did not have a good search engine until the launch of Deja News in 1995.
[3] Chris Sherman, "Happy Birthday, Lycos!," Search Engine Watch, August 14, 2002. See http://searchenginewatch.com/showPage.html?page=2160551.
[4] A fairly good summary of the History of the Web can be found on Wikipedia.
[5] Michael K. Bergman (Aug 2001). “The Deep Web: Surfacing Hidden Value“. The Journal of Electronic Publishing 7 (1). An earlier version of this paper was published by BrightPlanet Corp. in July 2000.
[6] While there are variations, Linux, Apache, MySQL and the scripting languages of either Python, PHP, or Perl are often referred to as ‘LAMP‘, one central basis for much open source software and, more broadly, interoperable open-source application packages.
[7] I would make a few changes today, notably in deprecating XML somewhat.
[8] This table builds on Pluempitiwiriyawej and Hammer's schema by adding the fourth major category of language. See Charnyote Pluempitiwiriyawej and Joachim Hammer, "A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources," Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See ftp.dbcenter.cise.ufl.edu/Pub/publications/tr00-004.pdf.
UMBEL is a lightweight way to describe the subject(s) of Web content, akin to the relationship “isAbout”. Its subject reference structure is meant to be simple, universally applicable, and agnostic to the form or schema of source data. UMBEL does not replace formal domain or upper ontologies and has little or no inferential power. It is merely a pool of consensus ‘proxies’ to initially describe what subjects data sets are about.
UMBEL’s design includes binding mechanisms that work with HTML, tagging or other standard practices, including various RDF schema and more formal ontologies. Its reference subject ‘backbone’ is derived from the intersection of common subjects found on popularly used Web sites and other accepted subject references. Access and easy adoption is given preference over inferential or logical elegance.
In addition to its core reference subjects, the UMBEL project will provide look up, query, registration, pinging, and related services. The project is completely open and supported by a community process. All project products are made available without charge under Creative Commons licenses. UMBEL’s development is being backed by a number of leading open data efforts and entities; see the last section for how to get involved.
The UMBEL project stands for the Upper-level Mapping and Binding Exchange Layer. UMBEL is pronounced like “humble” — in keeping with its nature — except without the “h”. The name has the same Latin root as umbrella (umbra for shade, or umbella for parasol), meant to convey the umbrella-like nature of UMBEL’s subject bindings.
With dozens of protocols and hundreds of thousands of potentially useful data sets, there are many challenges to getting Web data to interoperate. Two of these problems are foundational.
First, there are dozens of formalisms, schema, models and serializations for characterizing and communicating data and data content on the Web, ranging from the simplest Web page to the most formal OWL ontologies. A universal mechanism is lacking for how these variations can describe or publish to one other what they are about. This mechanism must be simple, neutral, broadly applicable and widely accepted.
Second, even if this publication mechanism existed, there is no accepted set of subjects for referencing what this diverse content is about. No attempt to date to provide a reference subject structure has been widely accepted.
Combined, these twin problems mean there are few road signs and poor road maps for how to find relevant data sets on the Web. UMBEL provides simple — but necessary — first steps to address these basic problems.
Advocates and users of various models and formalisms on the Web have their real-world reasons for embracing each form. Domain experts and various communities have their own world views, represented by their own vocabularies and structure. Only by understanding and respecting those differences can means to bridge them become widely accepted.
There is, of course, no such thing as complete objectivity or neutrality. But, from the standpoint of UMBEL and its purpose, keeping its approach simple with a minimum of structure poses the least challenge to the world views of existing publishers and data sets on the Web — and therefore the best likelihood of wide acceptance. Where choices are necessary, such as the selection of the reference subjects themselves, building from accepted Web practices and norms helps minimize bias and arbitrariness.
Thus, by necessity, UMBEL must be simple with limited ambitions. Its reference structure is merely a ‘bag of subjects’, with each subject reference only acting as a ‘proxy’ to a set of concepts that specific users may describe and refer to in their own ways. UMBEL’s core structure is completely flat, with no implied hierarchy or structure amongst its reference subjects. UMBEL’s reference subjects are simply that, proxy references and no more.
UMBEL thus has no or minimal inference power (though some disambiguation is possible). Inferencing, usefulness and authoritativeness are the responsibility of others. UMBEL is meant only to be a map to possible subjects, not whether those destinations are worthwhile or, indeed, even correct.
The selection of the actual subject proxies within the UMBEL core are to be based on consensus use. The subjects of existing and popular Web subject portals such as Wikipedia and the Open Directory Project (among others) will be intersected with other widely accepted subject reference systems such as WordNet and library classification systems (among others) in order to derive the candidate pool of UMBEL subject proxies. The actual methodology and sources of this process are still being determined (see further the project specification).
The objective, in any case, is to provide a simple and transparent method for subject selection that reflects current use and consensus to the maximum extent possible. The anticipation is that the first subject candidate pool will number in the many hundreds to the low thousands of proxies.
UMBEL as a general subject ‘backbone’ is meant to be useful as a reference by more specific domains or ontologies, but not fully descriptive for any of them. The core, internal UMBEL ontology is to be based on RDF and written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System).
Very simple binding mechanisms will be developed and extended to the most widely employed approaches on the Web. UMBEL will, at minimum, support Atom, microformats, OPML, OWL, RDF, RDFa, RDF Schema, RSS, tags (via Tag Commons), and topic maps in its first release. The simplicity of the ontology and approach will enable other formats to be easily added.
Ping, update and registration protocols will also be provided for these formats. Existing project sponsors already possess a variety of ping, update, conversion and translation utilities for such purposes.
Besides the core structure, the UMBEL project will also develop a second ‘unofficial’ structure of hierarchical and interlinked subject relationships. This ‘unofficial’ structure will be used solely for look up and browsing functions, and will reside external to the core UMBEL subject and binding structure. Indeed, we anticipate that many such look-up structures from other parties may evolve over time for specific purposes and viewpoints.
Finally, besides development of the UMBEL ontology, the project will also be providing a data set registration service, information and collaboration Web site, tools clearinghouse, and support for language translations and some tools development.
The initial project site is at http://www.umbel.org, including this project introduction, the draft project specification (http://www.umbel.org/proposal.xhtml), and other helpful background information. A more interactive Web site is currently under development and will be announced shortly.
A mailing list you can monitor or join to become part of the project is at http://groups.google.com/group/umbel-ontology.
There has been much recent excitement surrounding RDF, its linked data, “RDFizers” and GRDDL to convert existing structured information to RDF. The belief is that 2007 is the breakout year for the semantic Web.
The why of the semantic Web is now clear. The how of the semantic Web is now appearing clear, built around RDF as the canonical data model and the availability and maturation of a growing set of tools. And the what is also becoming clear, with the massive new datastores of DBpedia, Wikipedia3, the HCLS demo, Musicbrainz, Freebase, and the Encyclopedia of Life only being some of the most recent and visible exemplars.
Yet the where aspect seems to be largely missing.
By where I mean: Where do we look for our objects or data? If we have new objects or data, where do we plug into the system? Amongst all possible information and domains, where do we fit?
These questions seem simplistic or elemental in their basic nature. It is almost incomprehensible to wonder where all of this data now emerging in RDF relates to each other — what the overall frame of reference is — but my investigations seem to point to such a gap. In other words, a key piece of the emerging semantic Web infrastructure — where is this stuff — seems to be missing. This gap is across domains and across standard ontologies.
What are the specific components of this missing where, this missing infrastructure? I believe them to be:
I discuss these missing components in a bit more detail below, concluding with some preliminary thoughts on how the problem of this critical infrastructure can be redressed. The good news, I believe, is that these potholes on the road to the semantic Web can be relatively easily and quickly filled.
I think it is fair to say that structure on the current Web is a jumbled mess.
As my recent Intrepid Guide to Ontologies pointed out, there are at least 40 different approaches (or types of ontologies, loosely defined) extant on the Web for organizing information. These approaches embrace every conceivable domain and subject. The individual data sets using these approaches span many, many orders of magnitude in size and range of scope. Diversity and chaos we have aplenty, as the illustrative diagram of this jumbled structural mess shows below.
Mind you, we are not yet even talking about whether one dot is equivalent or can be related to another dot and in what way (namely, connecting the dots via real semantics), but rather at a more fundamental level. Does one entire data set have a relation to any other data set?
Unfortunately, the essential precondition of getting data into the canonical RDF data model — a challenge in its own right — does little to guide us as to where these data sets exist or how they may relate. Even in RDF form, all of this wonderful RDF data exists as isolated and independent data sets, bouncing off of one another in some gross parody of Brownian motion.
What this means, of course, is that useful data that could be of benefit is overlooked or not known. As with problems of data silos everywhere, that blindness leads to unnecessary waste, incomplete analysis, inadequate understanding, and duplicated effort [1].
These gaps were easy to overlook when the focus of attention was on the why, what and how of the semantic Web. But, now that we are seeing RDF data sets emerge in meaningful numbers, the time is ripe to install the road signs and print up the maps. It is time to figure out where we want to go.
As I discussed in an earlier posting, there’s not yet enough backbone to the structured Web. I believe this structure should firstly be built around a lightweight subject- or topic-oriented reference layer.
| An umbrella subject reference becomes the “super-structure” to which other specific ontologies can place themselves in an “info-spatial” context. |
Unlike traditional upper-level ontologies (see the Intrepid Guide), this backbone is not meant to be comprised of abstract concepts or a logical completeness of the “nature of knowledge”. Rather, it is meant to be only the thinnest veneer of (mostly) hierarchically organized subjects and topic references (see more below).
This subject or topic vocabulary (at least for the backbone) is meant to be quite small, likely more than a few hundred reference subjects, but likely less than many thousands. (There may be considerably more terms in the overall controlled vocabulary to assist context and disambiguation.)
This “umbrella” subject structure could be thought of as the reference subject “super-structure” to which other specific ontologies could place themselves in a sort of locational or “info-spatial” context.
One way to think of these subject reference nodes is as the major destinations — the key cities, locations or interchanges — on the broader structured Web highway system. A properly constructed subject structure could also help disambiguate many common misplacements by virtue of the context of actual subject mappings.
For example, an ambiguous term such as “driver” becomes unambiguous once it is properly mapped to one of its possible topics such as golf, printers, automobiles, screws, NASCAR, or whatever. In this manner, context is also provided for other terms in that contributing domain. (For example, we would now know how to disambiguate “cart” as a term for that domain.)
A high-level and lightweight subject mapping layer does not warrant difficult (and potentially contentious) specificity. The point is not to comprehensively define the scope of all knowledge, but to provide the fewest choices necessary for what subject or subjects a given domain ontology may appropriately reference. We want a listing of the major destinations, not every town and parish in existence.
(That is not to say that more specific subject references won’t emerge or be appropriate for specific domains. Indeed, the hope is that an “umbrella” reference subject structure might be a tie-in point for such specific maps. The more salient issue addressed here is to create such an “umbrella” backbone in the first place.)
This subject reference “super-structure” would in no way impose any limits on what a specific community might do itself with respect to its own ontology scope, definition, format, schema or approach. Moreover, there would be no limit to a community mapping its ontology to multiple subject references (or “destinations”, if you will).
The reason for this high-level subject structure, then, is simply to provide a reference map for where we might want to go — no more, no less. Such a reference structure would greatly aid finding, viewing and querying actual content ontologies — of whatever scope and approach — wherever that content may exist on the Web.
This is not a new idea. About the year 2000 the topic map community was active with published subject indicators (PSIs) [2] and other attempts at topic or subject landmarks. For example, that report stated:
From that earlier community, Bernard Vatant has subsequently spoken of the need and use of “hubjects” as organizing and binding points, as has Jack Park and Patrick Durusau using the related concept of “subject maps” [3]. An effort that has some overlap with a subject structure is also the Metadata Registry being maintained by the National Science Digital Library (NSDL).
However, while these efforts support the idea of subjects as partial binding or mapping targets, none of them actually proposed a reference subject structure. Actual subject structures may be a bit of a “third rail” in ontology topics due to the historical artifact of wanting to avoid the pitfalls of older library classification systems such as the Dewey Decimal Classification or the Library of Congress Subject Headings.
Be that as it may. I now think the timing is right for us to close this subject gap.
This mapping layer lends itself to a three-tiered general conceptual model. The first tier is the subject structure, the conceptualization embracing all possible subject content. This referential layer is the lookup point that provides guidance for where to search and find “stuff.”
The second layer is the representation layer, made up of informal to “formal” ontologies. Depending on the formalism, the ontology provides more or less understanding about the subject matter it represents, but at minimum binds to the major subject concepts in the top subject mapping layer.
The third layer are the data sets and “data spaces” [4] that provide the actual content instantiations of these subjects and their ontology representations. This data space layer is the actual source for getting the target information.
Here is a diagram of this general conceptual model:
The layers in this general conceptual model progress from the more abstract and conceptual at the upper level, useful for directing where traffic needs to go, to concrete information and data at the lower level, the real object of manipulation and analysis.
The data spaces and ontologies of various formalisms in the lower two tiers exist in part today. The upper mapping layer does not.
By its nature the general upper mapping layer, in its role as a universal reference “backbone,” must be somewhat general in the scope of its reference subjects. As this mapping infrastructure begins to be fleshed out, it is also therefore likely that additional intermediate mapping layers will emerge for specific domains, which will have more specific scopes and terminology for more accurate and complete understanding with their contributing specific data spaces.
So, let’s beg the question for a moment of what an actual reference subject structure might be. Let’s assume one exists. What should we do with this structure? How should we bind to it?
How this might look in a representative subject binding to two candidate ontology data sets is shown below:

This diagram shows two contributing data sets and their respective ontologies (no matter how “formal”) binding to a given “subject.” This subject proxy may not be the only one bound by a given data set and its ontology. Also note the “subject” is completely arbitrary and, in fact, is only a proxy for the binding to the potential topic.
Clearly, this approach does not provide a powerful inferential structure.
But, what it does provide is a quite powerful organizational structure. Access and the where for the sake of simplicity and adoption are given preference over inferential elegance. Thus, assuming now, again, a subject structure backbone, matched with these principles and the same jumbled structures as first noted above, we can now see an organizational order emerge from the chaos:
The assumptions to get to this point are not heroic. Simple binding mechanisms matched with a high-level subject “backbone” are all that is required mechanically for such an approach to emerge. All we have done to achieve this road map is to follow the trails above.
So, there is nothing really revolutionary in any of the discussion to this point. Indeed, many have cited the importance of reference structures previously. Why hasn’t such a subject structure yet been developed?
One explanation is that no one accepts any global “external authority” for such subject identifications and organization. The very nature of the Web is participatory and democratic (with a small “d”). Everyone’s voice is equal, and any structure that suggests otherwise will not be accepted.
It is not unusual, therefore, that some of the most accepted venues on the Web are the ones where everyone has an equal chance to participate and contribute: Wikipedia, Flickr, eBay, Amazon, Facebook, YouTube, etc. Indeed, figuring out what generates such self-sustaining “magic” is the focus of many wannabe ventures.
While that is not our purpose here, our purpose is to set the preconditions for what would constitute a referential subject structure that can achieve broad acceptance on the Web. And in the paragraph above, we already have insight into the answer: Build a subject structure from already accepted sources rich in subject content.
A suitable subject structure must be adaptable and self-defining. These criteria reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.
One obvious foundation to building a subject structure is thus Wikipedia. That is because the starting basis of Wikipedia information has been built entirely from the bottom up — namely, what is a deserving topic. This has served Wikipedia and the world extremely well, with now nearly 1.8 million articles online in English alone (versions exist for about 100 different languages) [6]. There is also a wealth of internal structure within Wikipedia’s “infobox” templates, structure that has been utilized by DBpedia (among others) to actually transform Wikipedia into an RDF database (as I described in an earlier article). As socially-driven and -evolving, I foresee Wikipedia to continue to be the substantive core at the center of a knowledge organizational framework for some time to come.
But Wikipedia was never designed with an organizing, high-level subject structure in mind. For my arguments herein, creating such an organizing (yes, in part, hierarchical) structure is pivotal.
One innovative approach to provide a more hierarchical structural underpinning to Wikipedia has been YAGO (“yet another great ontology”), an effort from the Max-Planck-Institute Saarbrücken [7]. YAGO matches key nouns between Wikipedia and WordNet, and then uses WordNet’s well-defined taxonomy of synsets to superimpose the hierarchical class structure. The match is more than 95% accurate; YAGO is also designed for extensibility with other quality data sets.
I believe YAGO or similar efforts show how the foundational basis of Wikipedia can be supplemented with other accepted lexicons to derive a suitable subject structure with appropriate high-level “binding” attributes. In any case, however constructed, I believe that a high-level reference subject structure must evolve from the global community of practice, as has Wikipedia and WordNet.
I have previously described this formula as W + W + S + ? (for Wikipedia + WordNet + SKOS + other?). There indeed may need to be “other” contributing sources to construct this high-level reference subject structure. Other such potential data sets could be analyzed for subject hierarchies and relationships using fairly well accepted ontology learning methods. Additional techniques will also be necessary for multiple language versions. Those are important details to be discussed and worked out.
The real point, however, is that existing and accepted information systems already exist on the Web that can inform and guide the construction of a high-level subject map. As the contributing sources evolve over time, so could periodic updates and new versions of this subject structure be generated.
Though the choice of the contributing data sets from which this subject structure could be built will never be unanimous, using sources that have already been largely selected through survival of the “fittest” by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference structure — and not a complete closed-world definition — we are also setting realistic thresholds for acceptance.
The specific topic of this piece has been on a subject reference mapping and binding layer that is lightweight, extensible, and reflects current societal practice (broadly defined). In the discussion, there has been recognition of existing schema in such areas as people (FOAF), projects (DOAP), communities (SIOC) and geographical places (Geonames) that might also contribute to the overall binding structure. There very well may need to be some additional expansions in other dimensions such as time and events, organizations, products or whatever. I hope that a consensus view on appropriate high-level dimensions emerges soon.
There are a number of individuals presently working on a draft proposal for an open process to create this subject structure. What we are working quickly to draft and share with the broader community is a proposal related to:
We believe this to be an exciting and worthwhile endeavor. Prior to the unveiling of our public Web site and project, I encourage any of you with interest in helping to further this cause to contact me directly at mike at mkbergman dot com [8].
[1] I attempted to quantify this problem in a white paper from about two years ago, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents, BrightPlanet Corporation, 42 pp., July 20, 2005. Some reasons for how such waste occurs were documented in a four-part series on this AI3 blog, Why Are $800 Billion in Document Assets Wasted Annually?, beginning in October 2005 through parts two, three and four concluding in November 2005.
[2] See Published Subjects: Introduction and Basic Requirements (OASIS Published Subjects Technical Committee Recommendation, 2003-06-24.
[3] See especially Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies.
[4] The concept of “data spaces” has been well-articulated by Kingsley Idehen of OpenLink Software and Frédérick Giasson of Zitgist LLC. A “data space” can be personal, collective or topical, and is a virtual “container” for related information irrespective of storage location, schema or structure.
[5] If the publisher gets it wrong, and users through the reference structure don’t access their desired content, there will be sufficient motivation to correct the mapping.
[6] See Wikipedia’s statistics sections.
[7] Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, “Yago – A Core of Semantic Knowledge” (also in bib or ppt). Presented at the 16th International World Wide Web Conference (WWW 2007) in Banff, Alberta, on May 8-12, 2007. YAGO contains over 900,000 entities (like persons, organizations, cities, etc.) and 6 million facts about these entities, organized under a hierarchical schema. YAGO is available for download (400Mb) and converters are available for XML, RDFS, MySQL, Oracle and Postgres. The YAGO data set may also be queried directly online.
[8] I’d especially like to thank Frédérick Giasson and Bernard Vatant of Mondeca for their reviews of a draft of this posting. Fred was also instrumental in suggesting the ideas behind the figure on the general conceptual model.
Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.
The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”
While there have been attempts to strap on more or less formal understandings or machinery around ontology, it still has very much the sense of a world view, a means of viewing and organizing and conceptualizing and defining a domain of interest. As is made clear below, I personally prefer a loose and embracing understanding of the term (consistent with Deborah McGuinness’s 2003 paper, Ontologies Come of Age [1]).
There has been a resurgence of interest in ontologies of late. Two reasons have been the emergence of Web 2.0 and tagging and folksonomies, as well as the nascent emergence of the structured Web. In fact, on April 23-24 one of the noted communities of practice around ontologies, Ontolog, sponsored the Ontology Summit 2007 ,“Ontology, Taxonomy, Folksonomy: Understanding the Distinctions.”
These events have sparked my preparing this guide to ontologies. I have to admit this is a somewhat intrepid endeavor given the wealth of material and diversity of opinions.
Of course, a fancy name is not sufficient alone to warrant an interest in ontologies. There are reasons why understanding, using and manipulating ontologies can bring practical benefit:
Both structure and formalism are dimensions for classifying ontologies, which combined are often referred to as an ontology’s “expressiveness.” How one describes this structure and formality differs. One recent attempt is this figure from the Ontology Summit 2007‘s wrap-up communique:

Note the bridging role that an ontology plays between a domain and its content. (By its nature, every ontology attempts to “define” and bound a domain.) Also note that the Summit’s 50 or so participants were focused on the trade-off between semantics v. pragmatic considerations. This was a result of the ongoing attempts within the community to understand, embrace and (possibly) legitimize “less formal” Web 2.0 efforts such as tagging and the folksonomies that can result from them.
There is an M.C. Escher-like recursion of the lizard eating its tail when one observes ontologists creating an ontology to describe the ontological domain. The above diagram, which itself would be different with a slight change in Summit participation or editorship, is, of course, but one representative view of the world. Indeed, a tremendous variety of scientific and research disciplines concern themselves with classifying and organizing the “nature of things.” Those disciplines go by such names as logicians, taxonomists, philosophers, information architects, computer scientists, librarians, operations researchers, systematicists, statisticians, historians, and so forth. (In short, given our ontos, every area of human endeavor has the urge to classify, to organize.) In each of these areas not only do their domains differ, but so do the adopted structures and classification schemes often used.
There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach:
Actual domains or subject coverage are then mostly orthogonal to these approaches.
Loosely defined, the number of possible ontologies is therefore close to infinite: domain X perspective X schema. (Just kidding — sort of! In fact, UMBC’s Swoogle ontology search service claims 10,000 ontologies presently on the Web; the actual data from August 2006 ranges from about 16,000 to 92,000 ontologies, depending on how “formal” the definition. These counts are also limited to OWL-based ontologies.)
Many have misunderstood the semantic Web because of this diversity and the slipperiness of the concept of an ontology. This misunderstanding becomes flat wrong when people claim the semantic Web implies one single grand ontology or organizational schema, One Ring to Rule Them All. Human and domain diversities makes this viewpoint patently false.
The choice of an ontological approach to organize Web and structured content can be contentious. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language or microformats. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) and the still greater formalism of OWL’s various dialects.
Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it to a canonical form, we can then move on to needed developments in semantic mediation, the threshold condition for the semantic Web.
| There are at least 40 concepts — loosely defined — that could be called an “ontology” framework or approach. |
So, diversity is inevitable and should be accepted. But that observation need not also embrace chaos.
In my early training in biological systematics, Ernst Haeckel’s recapitulation theory that “ontogeny recapitulates phylogeny” (note the same ontos root, the difference from ontology being growth v. study) was losing favor fast. The theory was that the development of an organism through its embryological phases mirrors its evolutionary history. Today, modern biologists recognize numerous connections between ontogeny and phylogeny, explain them using evolutionary theory, or view them as supporting evidence for that theory.
Yet, like the construction of phylogenetic trees, systematicists strive for their classifications of the relatedness of organisms to be “natural”, to reflect the true nature of the relationship. Thus, over time, that understanding of a “natural” system has progressed from appearance → embryology → embryology + detailed morphology → species and interbreeding → DNA. While details continue to be worked out, the degree of genetic relatedness is now widely accepted by biologists as a “natural” basis for organizing the Tree of Life.
It is not unrealistic to also seek “naturalness” in the organization of other knowledge domains, to seek “naturalness” in the organization of their underlying ontologies. Like natural systems in biology, this naturalness should emerge from the shared understandings and perceptions of the domain’s participants. While subject matter expertise and general and domain knowledge are essential to this development, they are not the only factors. As tagging systems on the Web are showing, common usage and broad acceptance by the community at hand is important as well.
While it may appear that a domain such as the biological relatedness of organisms is more empirical than the concepts and ambiguous words in most domains of human endeavor, these attempts at naturalness are still not foolish. The phylogeny example shows that understanding changes over time as knowledge is gained. We now accept DNA over the recapitulation theory.
As the formal SKOS organizational schema for knowledge systems recognizes (see below), the ideas of narrower and broader concepts can be readily embraced, as well as concepts of relatedness and aliases (synonyms). These simple constructs, I would argue, plus the application of knowledge being gained in related domains, will enable tomorrow’s understandings to be more “natural” than today’s, no matter the particular domain at hand.
So, in seeking a “naturalness” within our organizational schema, we can also see that change is a constant. We also see that the tools and ideas underlying the seemingly abstract cause of merging and relating existing ontologies to one another will further a greater “naturalness” within our organizations of the world.
According to the Summit, expressiveness is the extent and ease by which an ontology can describe domain semantics. Structure they define as the degree of organization or hierarchical extent of the ontology. They further define granularity as the level of detail in the ontology. And, as the diagram above alludes, they define other dimensions of use, logical basis, purpose and so forth of an ontology.
The over fifty respondents from 42 communities submitted some 70 different ontologies under about 40 terms to a survey that was used by the Summit to construct their diagram. These submissions included:
. . . formal ontologies (e.g., BFO, DOLCE, SUMO), biomedical ontologies (e.g., Gene Ontology, SNOMED CT, UMLS, ICD), thesauri (e.g., MeSH, National Agricultural Library Thesaurus), folksonomies (e.g., Social bookmarking tags), general ontologies (WordNet, OpenCyc) and specific ontologies (e.g., Process Specification Language). The list also includes markup languages (e.g., NeuroML), representation formalisms (e.g., Entity-Relation model, OWL, WSDL-S) and various ISO standards (e.g., ISO 11179). This [Ontolog] sample clearly illustrates the diversity of artifacts collected under “ontology”.
I think the simplest spectrum for such distinctions is the formalism of the ontology and its approach (and language or syntax, not further discussed here). More formal ontologies have greater expressiveness and structure and inferential power, less formal ones the opposite. Constructing more formal ontologies is more demanding, and takes more effort and rigor, resulting in an approach that is more powerful but also more rigid and less flexible. Like anything else, there are always trade-offs.
Based on work by Leo Obrst of Mitre as interpreted by Dan McCreary, we can view this as a trade-off as one of semantic clarity v. the time and money required to construct the formalism [12, 13]:
Note this diagram reflects the more conventional, practitioner’s view of the “formal” ontology, which does not include taxonomies or controlled vocabularies (for example) in the definition. This represents the more “closely defined” end of the ontology (semantic) spectrum.
However, since we are speaking here of ontologies and the structured Web or the semantic Web, I believe we need to embrace a concept of ontology aligned to actual practice. Not all content providers can or want to employ ontology engineers to enable formal inferencing of their content. Yet, on the other hand, their content in its various forms does have some meaningful structure, some organization. The trick is to extract this structure for more meaningful use such as data exchange or data merging.
Under such “loosely defined” bases we can thus see a spectrum of ontology approaches on the Web, proceeding from less structure and formalism to more so:
| Type or Schema | Examples | Comments on Structure and Formalism | |
| Standard Web Page | entire Web | General metadata fields in the and internal HTML codes and tags provide minimal, but useful sources of structure; other HTTP and retrieval data can also contribute | |
| Blog / Wiki Page | examples from Technorati, Bloglines, Wikipedia | Provides still greater formalism for the organization and characterization of content (subjects, categories, posts, comments, date/time stamps, etc.). Importantly, with the addition of plug-ins, some of the basic software may also provide other structured characterizations or output (SIOC, FOAF, etc.; highly varied and site-specific given the diversity of publishing systems and plug-ins) | |
| RSS / Atom feeds | most blogs and most news feeds | RSS extends basic XML schema for more robust syndication of content with a tightly controlled vocabulary for feed concepts and their relationships. Because of its ubiquity, this is becoming a useful baseline of structure and formalism; also, the nature of adoption shows much about how ontological structure is an artifact, not driver, for use | |
| RSS / Atom feeds with tags or OPML | Grazr, most newsfeed aggregators can import and export OPML lists of RSS feeds | The OPML specification defines an outline as a hierarchical, ordered list of arbitrary elements. The specification is fairly open which makes it suitable for many types of list data. See also OML and XOXO | |
| Hierarchical Faceted Metadata | XFML, Flamenco | These and related efforts from the information architecture (IA) community are geared more to library science. However, they directly contribute to faceted browsing, which is one of the first practical instantiations of the semantic Web | |
| Folksonomies | Flickr, del.icio.us | Based on user-generated tags and informal organizations of the same; not linked to any “standard” Web protocols. Both tags and hierarchical structure are arbitrary, but some researchers now believe over large enough participant sets that structural consensus and value does emerge | |
| Microformats | Example formats include hAtom, hCalendar, hCard, hReview, hResume, rel-directory, xFolk, XFN and XOXO | A microformat is HTML mark up to express semantics with strictly controlled vocabularies. This markup is embedded using specific HTML attributes such as class, rel, and rev. This method is easy to implement and understand, but is not free-form | |
| Embedded RDF | RDFa, eRDF | An embedded format, like microformats, but free-form, and not subject to the approval strictures associated with microformats | |
| Topic Maps | Infoloom, Topic Maps Search Engine | A topic map can represent information using topics (representing any concept, from people, countries, and organizations to software modules, individual files, and events), associations (which represent the relationships between them), and occurrences (which represent relationships between topics and information resources relevant to them) | |
| RDF | Many; DBpedia, etc. | RDF has become the canonical data model since it represents a “universal” conversion format | |
| RDF Schema | SKOS, SIOC, DOAP, FOAF | RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. This becomes the canonical ontology common meeting ground | |
| OWL Lite | Here are some existing OWL ontologies; also see Swoogle for OWL search facilities | The Web Ontology Language (OWL) is a language for defining and instantiating Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans. It facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. The three language versions are in order of increasing expressiveness | |
| OWL DL | |||
| OWL Full | |||
| Higher-order “formal” and “upper-level” ontologies | SUMO, DOLCE, PROTON, BFO, Cyc, OpenCyc | These provide comprehensive ontologies and often related knowledge bases, with the goal of enabling AI applications to perform human-like reasoning. Their reasoning languages often use higher-order logics |
As a rule of thumb, items that are less “formal” can be converted to a more formal expression, but the most formal forms can generally not be expressed in less formal forms.
As latter sections elaborate, I see RDF as the universal data model for representing this structure into a common, canonical format, with RDF Schema (specifically SKOS, but also supplemented by FOAF, DOAP and SIOC) as the organizing ontology knowledge representation language (KRL).
This is not to say that the various dialects of OWL should be neglected. In bounded environments, they can provide superior reasoning power and are warranted if they can be sufficiently mandated or enforced. But the RDF and RDF-S systems represent the most tractable “meeting place” or “middle ground,” IMHO.
As if the formalism dimension were not complicated enough, there is also the practice within the ontology community to characterize ontologies by “levels”, specifically upper, middle and lower levels. For example, chances are that you have heard particularly of “upper-level” ontologies.
The following figure helps illustrate this “level” dimension. This diagram is also from Leo Obrst of Mitre [12], and was also used in another 2006 talk by Jack Park and Patrick Durusau (discussed further below for other reasons):

Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak [2].
The above diagram conveys a sense of how multiple ontologies can relate to one another both in terms of narrower and broader topic matter and at the same “levels” of generalization. Such “meta-structure” (if you will) can provide a reference structure for relating multiple ontologies to one another.
| The relationships and mappings amongst ontologies is a critical infrastructure component of the semantic Web. |
It resides exactly in such bindings or relationships that we can foresee the promise of querying and relating multiple endpoints on the Web with accurate semantics in order to connect dots and combine knowledge bases. Thus, the understanding of the relationships and mappings amongst ontologies becomes a critical infrastructural component of the semantic Web.
We can better understand these mapping and inter-relationship concepts by using a concrete example with a formal ontology. We’ll choose to use the Suggested Upper Merged Ontology simply because it is one of the best known. We could have also selected another upper-level system such as PROTON [3] or Cyc [4] or one of the many with narrower concept or subject coverage.
SUMO is one of the formal ontologies that has been mapped to the WordNet lexicon, which adds to its semantic richness. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE. The ontologies that extend SUMO are available under GNU General Public License.
The abstract, conceptual organization of SUMO is shown by this diagram, which also points to its related MILO (MId-Level Ontology), which is being developed as a bridge between the abstract content of the SUMO and the richer detail of various domain ontologies:

At this level, the structure is quite abstract. But one can easily browse the SUMO structure. A nifty tool to do so is the KSMSA (Knowledge Support for Modeling and Simulation) ontology browser. Using a hierarchical tree representation, you can navigate through SUMO, MILO, WordNet, and (with the locally installed version) Wikipedia.
The figure below shows the upper-level entity concept on the left; the right-hand panel shows a drill-down into the example atom entity:
These views may be a bit misleading because the actual underlying structure, while it has hierarchical aspects as shown here, really is in the form of a directed acyclic graph (showing other relatedness options, not just hierarchical ones). So, alternate visualizations include traditional network graphs.
The other thing to note is that the “things” covered in the ontology, the entities, are also fairly abstract. That is because the intention of a standard “upper-level” ontology is to cover all relevant knowledge aspects of each entity’s domain. This approach results in a subject and topic coverage that feels less “concrete” than the coverage in, say, an encyclopedia, directory or card catalog.
According to Park and Durusau, upper ontologies are diverse, middle ontologies are even more diverse, and lower ontologies are more diverse still. A key observation is that ontological diversity is a given and increases as we approach real user interaction levels. Moreover, because of the “loose” nature of ontologies on the Web (now and into the future), diversity of approach is a further key factor.
Recall the initial discussion on the role and objectives of ontologies. About half of those roles involve effectively accessing or querying more than one ontology. The objective of “upper-level” ontologies, many with their own binding layers, is also expressly geared to ontology integration or federation. So, what are the possible mechanisms for such binding or integration?
A fundamental distinction within mechanisms to combine ontologies is whether it is a unified or centralized approach (often imposed or required by some party) or whether it is a schema mapping or binding approach. We can term this distinction centralized v. federated.
Centralized approaches can take a number of forms. At the most extreme, adherence to a centralized approach can be contractual. At the other end are reference models or standards. For example, illustrative reference models include:
Though I have argued that One Ring to Rule them All is not appropriate to the general Web, there may be cases within certain enterprises or where through funding clout (such as government contracts), some form of centralized approach could be imposed [5]. And, frankly, even where compliance can not be assured, there are advantages in economy, efficiency and interoperability to attempt central ontologies. Certain industries — notably pharmaceuticals and petrochemicals — and certain disciplines — such as some areas of biology among others — have through trade associations or community consensus done admirable jobs in adopting centralized approaches.
However, combining ontologies in the context of the broader Internet is more likely through federated approaches. (Though federated approaches can also be improved when there are consensual standards within specific communities.) The key aspect of a federated approach is to acknowledge that multiple schema need to be brought together, and that each contributing data set and its schema will not be altered directly and will likely remain in place.
Thus, the key distinctions within this category are the mechanisms by which those linkages may take place An important goal in any federated approach is to achieve interoperability at the data or instance level without unacceptable loss of information or corruption of the semantics. Numerous specific approaches are possible, but three example areas in RDF-topic map interoperability, the use of “subject maps”, and binding layers can illustrate some of the issues at hand.
In 2006 the W3C set up a working group to look at the issue of RDF and topic maps interoperability. Topic maps have been embraced by the library and information architecture community for some time, and have standards that have been adopted under ISO. Somewhat later but also in parallel was the development of the RDF standard by W3C. The interesting thing was that the conceptual underpinnings and objectives between these two efforts were quite similar. Also, because of the substantive thrust of topic maps and the substantive needs of its community, quite a few topic maps had been developed and implemented.
One of the first efforts of the W3C work group was to evaluate and compare five or six extant proposals for how to relate RDF and topic maps [6]. That report is very interesting reading for any one desirous of learning more about specific issues in combining ontologies and their interoperability. The result of that evaluation then led to some guidelines for best practices in how to complete this mapping [7]. Evaluations such as these provide confidence that interoperability can be achieved between relatively formal schema definitions without unacceptable loss in meaning.
A different, “looser” approach, but one which also grew out of the topic map community, is the idea of “subject maps.” This effort, backed by Park and Durusau noted above, but also with the support of other topic map experts such as Steve Newcomb and Robert Barta via their proposed Topic Maps Reference Model (ISO 13250-5), seems to be one of the best attempts I’ve seen that both respects the reality of the actual Web while proposing a workable, effective scheme for federation.
The basic idea of a subject map is built around a set of subject “proxies.” A subject proxy is a computer representation of a subject that can be implemented as an object, must have an identity, and must be addressable (this point provides the URI connector to RDF). Each contributing schema thus defines its own subjects, with the mappings becoming meta-objects. These, in turn, would benefit from having some accepted subject reference schema (not specifically addressed by the proponents) to reduce the breadth of the ultimate mapped proxy “space.”
I don’t have the expertise to judge further the specifics, but I find the presentation and papers by Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies to be worthwhile reading in any case. I highly recommend these papers for further background and clarity.
As the third example, “binding layers” are a comparatively newer concept. Leading upper-level ontologies such as SUMO or PROTON propose their own binding protocols to their “lower” domains, but that approach takes place within the construct of the parent upper ontology and language. Such designs are not yet generalized solutions. By far the most promising generalized binding solution is the SKOS (Simple Knowledge Organization System). Because of its importance, the next section is devoted to it.
Finally, with respect to federated approaches, there are quite a few software tools that have been developed to aid or promote some of these specific methods. For, example, about twenty of the software applications in my Sweet Tools listing of 500+ semantic Web and -related tools could be interpreted as aiding ontology mapping or creation. You may want to check out some of these specific tools depending on your preferred approach [8].
SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the “loosely defined” ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide [9].
SKOS is built upon the RDF data model of the subject-predicate-object “triple.” The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).
Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.
This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:

The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as “broader” and “narrower”, which enable hierarchical relations to be modeled, as well as “related” and “member” to support networks and arrays, respectively [9].
We can visualize this transforming power by looking at how an “ontology” in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:
SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it [9]. There is also a useful draft schema that the W3C’s SWEO (Semantic Web Education and Outreach) group is developing to organize semantic Web-related information [10].
Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.
While not addressed directly in this piece, it is obviously of first importance to have content with structure before the questions of connecting that information can even arise. Then, that structure must also be available in a form suitable for merging or connection.
At that point, the subjects of this posting come into play.
| We are stubbing our toes on the rocks while we gaze at the heavens. |
We see that the daily Web has a diversity of schema or ontologies “loosely defined” for representing the structure of the content. These representations can be transferred to more complex schema, but not in the opposite direction. Moreover, the semantic basis for how to make these mappings also needs some common referents.
RDF provides the canonical data model for the data transfers and representations. RDFS, especially in the form of SKOS, appears to form one basis for the syntax and language for these transformations. And SKOS, with other schema, also appears to offer much of the appropriate “middle ground” for data relationships mapping.
However, lacking in this story is a referential structure for subject relationships [11]. (Also lacking are the ultimately critical domain specifics required for actual implementation.)
Abstract concepts of interest to philosophers and deep thinkers have been given much attention. Sadly, to date, concrete subject structures in which tangible things and tangible actions can be shared, is still very, very weak. We are stubbing our toes on the rocks while we gaze at the heavens.
Yet, despite this, simple and powerful infrastructures are well in-hand to address all foreseeable syntactic and semantic issues. There appear to be no substantive limits to needed next steps.
Lastly, many valuable resources for further reading and learning may be found within the Ontolog Community, W3C, TagCommons and Topics Maps groups. Enjoy! And be wary of ontology no longer.
[1] Deborah L. McGuinness. “Ontologies Come of Age”. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2003. See http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm
[2] I think it would be much clearer to refer to “upper level” ontologies as abstract or conceptual, “mid levels” as mapping or binding, and “lower levels” as domain (without any hierarchical distinctions such as lower or lowest or sub-domain), but current practice is probably too entrenched to change now.
[3] There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.
[4] See my earlier post on Cyc.
[5] Even with such clout, it is questionable to get rather complete adherence, as Ada showed within the Federal government. However, where circumstances allow it, central schema and ontologies may be worth pursuing because of improved interoperability and lower costs, even where some portions do not adhere and are more chaotic like the standard Web.
[6] See, A Survey of RDF/Topic Maps Interoperability Proposals, W3C Working Group Note 10 February 2006, Pepper, Vitali, Garshol, Gessa, Presutti (eds.)
[7] See, Guidelines for RDF/Topic Maps Interoperability, W3C Editor’s Draft 30 June 2006, Pepper, Presutti, Garshol, Vitali (eds.)
[8] Here are some Sweet Tools that may have a usefulness to ontology federation and creation:
[10] The SWEO classification ontology is still under active development and has these draft classes. Note, however, the relative lack of actual subject or topic matter:
Classes are currently defined as:
If the page describes an organization, it can be tagged as:
If the page is a person’s home page or blog or similar, it could be:
The type of audience can also be tagged, for example:
[11] The OASIS Topic Maps Published Subjects Technical Committee was formed a number of years back to promote Topic Maps interoperability through the use of Published Subjects Indicators (PSIs). Their resulting report was a very interesting effort that unfortunately did not lead to wide adoption, perhaps because the effort was a bit ahead of its time or it was in advance of the broader acceptance of RDF. This general topic is the subject of a later posting by me.
[12] See further, Leo Obrst, “The Semantic Spectrum & Semantic Models,” a Powerpoint presentation (http://ontolog.cim3.net/file/resource/presentation/LeoObrst_20060112/OntologySpectrumSemanticModels–LeoObrst_20060112.ppt)
made as part of an Ontolog Forum (http://ontolog.cim3.net/) presentation in two parts, “What is an Ontology? – A Briefing on the Range of Semantic Models” (see http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2006_01_12), in January 2006. Leo Obrst is a principal artificial intelligence scientist at MITRE’s (http://www.mitre.org) Center for Innovative Computing and Informatics and a co-convener of the Ontolog Forum. His presentation is a rich source of practical overview information on ontologies.
[13] The actual diagram is an unattributed modification by Dan McCreary (see http://www.danmccreary.com/presentations/sem_int/sem_int.ppt) based on Obrst’s material in [12].