Posted:June 7, 2007

Can this Popular Blogging Software Mature to World Class?

I have just gone through more time than I care to admit upgrading to version 2.2 of WordPress. I did this not only because the WP developers have signaled their intent to abandon 2.1 with a more formal upgrade and release schedule. I also did it because I was beginning — with site and user growth — to experience some noticeable performance issues. And, I also wanted to give the site its second-year anniversary tune-up.

Since I had upgraded about four previous times in the past two years (my earlier and popular Professional Blogging Guide was written for version 1.5 of WP), I pretty much assumed this next upgrade would also be generally smooth. Boy, was I wrong!

This whole experience, though not completely disillusioning, does raise some real questions and concerns. Likely some of it has to do with my own stupidity or neglect or the fact I am not a system admin. But those same factors apply to the majority of WordPress users that administer their own sites as well. If you fit in this category, Watch out! You may be sailing unawares into a perfect storm.

Standard Upgrade Instructions

Just for reference, I have a virtual dedicated (private) server (VPS) with 256 MB of RAM burstable to 1 GB. I’m running Linux CentOS 4, Apache 2.0.52, MySQL 4.1.20 and PHP 4.3.9. In other words, my VPS is a fairly typical LAMP installation. My VPS plan provides no support, but I do have a Simple Control Panel (Turbopanel), though most of my management occurs through SSH. Disk space and bandwidth are not issues, though I sometimes suspect those sharing the server are resource hogs (doesn’t everyone with a VPS suspect so?; at least it is not a standard hosting arrangement!).

As with past upgrades, the instructions for upgrading WordPress have become fairly standard. You can download WordPress version 2.2 from here and the WordPress folks offer pretty clear initial installation and upgrade guides. Read these and follow them.

If upgrading, it is critical that you backup your database and backup your relevant site changes (including theme changes and plug-ins) before attempting any installation.

(Though I personally did not have to resort to a full restore when I encountered the problems noted below, it was close. And, in any event, if you are able to put up with some downtime, even if everything goes to hell in an basket, you can still get back to your original configuration. Back up!).

The actual process of upgrading went pretty smooth for me. Physically copying files and getting the site to come back up wasn’t the problem. The problems arose in how the site (mis)behaved after the upgrade.

Problems and Inadequate Answers

For many reasons related to (a perhaps too-long list of plug-ins) I had no end of problems with this WordPress v. 2.2 upgrade. Here are the key ones I discovered:

  1. While WP v. 2.2 may be more efficient overall with 200 bugs fixed, claims I have not frankly been able to verify, its initial resource use has gone way up. I suspect this is because of the more aggressive use and addition of more JavaScript, but there may be other explanations as well
  2. One net upshot is that if you have been a user of many plug-ins in the past, you may find you have inadequate resources for the upgrade
  3. You may therefore see hard-to-diagnose behavior due to too many plug-ins. Tested individually, the plug-ins may look fine; in combination and in weird ways, you may see problems (for myself, it showed worse on my Chronological Listing page using the wp_get_archives call). White screens or partial page renderings are a telltale sign
  4. Many plug-ins are simply no longer working (this may have occurred in version 2.1, but I had not upgraded since version 2.0.5)
  5. Watch out for plug-ins that add tables to your MySQL database, such as stat packages or spam blockers. Not only do these insidiously cause your database to grow, but they introduce another possible point of failure and synchronicity issues with caches
  6. The visual editor (TinyMCE) as implemented in WP cause DIVs to be deleted in code without warning when switching back and forth in code view; nasty! there are other editor weaknesses as well (see below)
  7. Caching, either through individual options, or more especially in combinations, work horribly! I had numerous Apache race conditions where load averages headed through the roof bringing the virtual server to a complete standstill (including the inability to log in with SSH!)
  8. Many prior guides regarding performance tuning no longer seem useful or applicable, likely again to the JavaScript additions and other fundamental changes to the internal “loop”
  9. With or without caching, I was seeing anomalous and episodic loading delays and racing conditions with Apache/MySQL when using the internal Feedburner scripts
  10. Various other documentation and general project problems (also see below)
  11. A general lack of understanding of what or where to look regarding performance, configuration or (frankly, virtually) many of the operative aspects of my actual blog site, and
  12. Being attentive and investigating these matters has made me a better administrator by becoming familiar with existing reporting, monitoring and testing utilities already extant on my system or easily added to it.

More on some of these points is expanded upon below. A very helpful compilation of WordPress v. 2.2 issues and tips and resources is provided by the Blog Herald.

Now, I somewhat violated one of my personal cardinal rules for WordPress by upgrading at a single sub-version number. (I actually don’t understand or track the rationale for WordPress version upgrades; 2.2 was huge, as was 2.1, seemingly as great as the transition from 1.x to 2.x!). Somehow, however, I had the impression that 2.2 was a simple increment over 2.1, which did have a number of sub-sub releases. I did, however, wait a couple of weeks after initial release.

I suspect there will be a 2.2.1 release very soon, because the scope of issues that are emerging are pretty big. One source of new fixes will be hosting providers; I’m sure they are seeing some pretty bad stuff recently. I’d hate to own a hosting service at the moment with many WP users and have to both field their calls and wonder what the heck I have to do with my current server infrastructure to get all of my users back up to snuff!

If you don’t have to upgrade immediately, my advice would be to wait. There is quite a bit that needs to settle down, and some of the plug-in developers I believe have some big efforts ahead getting their stuff stable again.

Specific Areas Warranting Further Discussion

When all hell breaks loose and panic takes hold, it is easy to loose perspective. Plugging holes in the dike, while potentially effective, can also divert and exhaust from identifying and then diagnosing real root causes.

Truthfully, I don’t understand at all with documentation, code forums, and hundreds of thousands of WordPress Google results why it is so difficult to get authoritative answers to many questions and problems. Moreover, hosted sites versus self-administered ones and dedicated v. virtual v. shared servers all have different needs and configuration options.

So, let me add to the general background noise . . . .

Non-working Plug-ins

There is a page on the WordPress site that lists known working and non-working plug-ins for version 2.2. While useful, this list is incomplete.

I followed the general advice of individually re-activating and testing each plug-in after the upgrade. However, because of caching issues and general load problems due to using internal Feedburner scripts, some of my testing may have indicated a plug-in was not working when it was these interaction effects that are the problem. With a starting roster of about 25 plug-ins (see figure), the potential interactive and combinatorial effects are huge!

Currently Installed WP Plug-ins
[Click on image for full-size pop-up]

In my testing, I believe these plug-ins to not be compatible with WP v. 2.2: Advanced WYSIWYG, COinS Metadata Exposer, Customizable Post Listings, EzStatic, Kimili Flash Embed, Most Popular posts, and Smart Archives.

I strongly suspect these plug-ins are not working properly, and have de-activated in any case: ImageManager, Popularity Contest, Post Teaser, Spam Karma 2 (WP site indicates it is OK, but see below), StatTraq, and wp-cache.

This is more than half of my initial plug-ins that showed likely problems. As the figure above shows, I am now only using eight plug-ins, one of which is new that I updated myself (the Advanced TinyMCE Editor, discussed in an upcoming posting).

Of course, I could be wrong on any of these plug-ins due to the testing and combinatorial challenges noted above. I do not mean to impugn any individual plug-in. Indeed, I installed them initially because I perceived them to provide value. So, if I have erred in my assessment, I apologize in advance. But no matter how measured, compatibility is a real concern.

Too Many Plugins?

There has been discussion for some time regarding the impact of the number of plug-ins on WordPress performance. For example, a listing in early 2007 from Alex King’s blog seems to have the largest nucleus of discussion. The earlier consensus was that too many plug-ins may be an issue, but 30 or so were OK. (Naturally, some plug-ins are more resource-intensive than others.)

Frankly, up until now, I really had not given the question much thought. I would encounter what looked to be a useful functional addition, install and activate the plug-in, and go on my merry way.

The tip off that something might not be right arose when my chronological listing of posts broke after the upgrade. My first iteration of this page used the wp_get_archives function, often used for archive listings in sidebars of WP blogs. (I later used the Customizable Post Listings, which is a very nice one, but it is broken in 2.2.)

What I stumbled upon late in my upgrade testing was the observation that as more plug-ins were added to my system, the number of listed archive postings would decrease (also losing the footer). Under conditions of too many plug-ins, the entire page would go blank.

It was total serendipity seeing this behavior, and only occurred after much hair pulling and gnashing of teeth as to why seemingly OK plug-ins individually or in small combinations worked fine, but when all invoked worked to trash the site.

I suspect some big bug resides somewhere in the system related to resource use. My temporary fix has been to set an absolute limit in the ‘postbypost’ parameter in the function that is well below my actual total post counts. Go figure.

Stat and Spam Packages Not Worth the Overhead?

When everything appears to be broken no stone goes unquestioned or unturned. Because of their complexity, I spent considerable time looking at my statistics and spam plug-ins.

One possible cause for degrading site performance was a rapidly growing database. I had been getting complaints from users regarding slow load times in the past couple of months. I really did not want to bite the MySQL optimization bullet and tuning, but could avoid it no longer.

The first thing I had noticed was that my MySQL database had more than doubled in size in six months to 368 MBs! (43.5 MB zipped). I knew I was writing some long posts, but that seemed ridiculous.

So, while I had been faithfully backing up my site, I tended to avoid probing into MySQL. Faced with the unavoidable, I happened to notice, however, that 280 MB, or 75%! of my database size was due to storage of StatTraq statistics. (I know, you idiot, why didn’t I see this sooner?) I was blown away about how this silent accumulator had become the eggplant eating MySQL due to my inattentiveness.

While I had earlier looked into StatTraq speedups thinking that might be a cause of some of my performance issues, I frankly was shocked at the resource demands of this application. In fact, the only reason I had installed the application in the first place was in order to extract the minute amount of data for providing a popularity listing of prior posts in my sidebar. This was the result of an early decision when I first set up my blog and one I had not questioned or re-considered since.

That raises the broader question of why embed stats in your internal blog? You can get the same basic tracking and statistics from Google, Feedburner, Sitemeter, Stat Counter, your own internal Webalizer on most virtual servers, etc. Now, with experience and reflection, it is absolutely crazy to burden WP and MySQL further when such services can so easily and freely be obtained elsewhere.

I still liked the idea of a popular posts listing, and tried a couple of alternatives, including Alex King’s Popularity Contest. However, once I saw that these systems, too, added tables to MySQL, I decided to swear off the sweets and let go of being enamored with popularity listings.

So, with the removal of internal stats, my database reduced to a more manageable 80 MB (7.5 MB zipped).

Now on the war path for insidious MySQL sneaks, I next looked into Spam Karma 2. Actually, this plug-in has performed great for me, but again, it is complicated and writes considerable data to MySQL. In fact, even with new trimmed database, SK2 was occupying about 25% of total database space!

I presently have SK2 disabled, and am getting surprisingly good spam control with the simple use of the Comment Timeout. It seems to take some time for spammers to harvest new posting listings, so, if you are not a comment heavy site (as is the case for this AI3 blog), then timing out comments on older postings looks to be quite effective with not nearly so much overhead.

I haven’t yet removed SK2 tables, and may choose to re-activate it again in the future. But I am definitely now in the mood to question the operating profiles of statistics and spam plug-ins.

Caching

In the past week, I’ve researched and tested Apache server settings, MySQL queue caching, WP internal caching, external caching plug-ins (WP-Cache 2) and PHP opcode caches, alone and in combination.

Truth be told, despite coming across a few guides that appeared authoritative, I’m not sure that all systems work well with WP v 2.2, some definitely have issues (WP-Cache 2), combinations can produce race conditions, and, in general, nothing feels stable enough at present to use. I will likely need to await 2.2.6 with specific announcements of compatibility made by specific cache mechanisms before I try using them again.

Server Resources

Of course, the 256 MB of RAM item above should have caught most everyone’s attention. I know my current RAM is at the bare minimum, and perhaps most of the issues I’ve flogged in this post could be solved in a rich RAM environment.

I will look into this issue, but I’m also cheap, cheap. So, RAM and support and a host of considerations will go into any future hosting provider changes. Who knows, I might even choose to host myself?

(Actually, I need to get back to work. After the experience of the past week, I’m itching to let my blog site run as configured for quite a while before I attempt any major new changes, let alone migrating to another provider!)

Visual (Rich-text) TinyMCE Editor

TinyMCE is, IMHO, a fantastic JavaScript based WYSIWYG editor that has matured nicely over time with a good feature set and stable operation. However, the philosophy and implementation to adopting this editor by the WordPress team simply sucks! I repeat, sucks!

A posting later this week will address this topic further.

Longer-term Cracks in the Foundation?

I still (painfully) remember my transition from WordPerfect to MS Word, necessitated by market and technology shifts. But I loved WordPerfect and knew how to make it sing. I’m sure I’m not as proficient with WordPress, but I have come to understand and work around its quirks sufficient to be pretty productive. Yet, based on current trends, I’m fearful those days may be coming to an end.

There is an overarching set of issues behind the WP project that is starting to become apparent to me. Though I have used and recommended the system for two years, and have written one of the popular guides about how to roll your own professional WP blog site, here are those troubling trends and gaps I am seeing:

  • I’m actually pretty disappointed with both the pace and bugginess of WordPress upgrades. As my site has gotten more complicated, so have these upgrades and the unwelcome downtime and diversion of working through (often) poorly tested releases. Actually, I have been getting in the habit of skipping releases and then only upgrading after a major subversion number has come out (2.2.1 or 2.2.2 versus initial 2.2, for example). Unfortunately, in my current case, I was having some performance problems that required tuning and other caching adjustments. Why can not WP have upgrade approaches with the ease and professionalism of Mozilla and Firefox?
  • The WP plug-in system, if one can call it that, is a nightmare. Though there are guidances for how to write a plug-in, there is nothing that might be called a formal API. Registration procedures are non-existent; all one need do is enter a few lines of text in the head of a PHP file. And, there is absolutely poor resources for listing available plug-ins (WordPress has tried its own plug-ins directory and is making slow improvements, but the Weblogs Tool Collection is still better, though incomplete.) Again, where is the elegance and togetherness shown by the Firefox/Mozilla add-on system and registry?
  • While I may be off on its technical feasibility, it would be attractive to be able to better internally configure WP and its resource use. There are literally no configuration parameters that can be set in WP, other than poorly documented options such as ENABLE_CACHE. What it does and why or how is unclear, let alone other such possible settings. The WP user must go into multiple external apps including the OS, MySQL and Apache (or other Web server; what are the supported options??) and attempt via trial-and-error to do performance tuning
  • Resource guides and documentation are unnecessarily poor, especially regarding system configuration, performance and tuning, and
  • Obsoleted code documentation is kept online well after updates, and professional APIs are non-existent.

Granted, perhaps it is unfair to compare WP with Mozilla and its installed base of perhaps 100-fold more users. Yet, on the other hand, WP has been around for quite some time and has millions of its own users.

Without a doubt, most would observe (and I would agree) that some of the upgrade difficulties covered in this article were of my own doing. Add a plug-in here, one there, keep blogging, and only occasionally dig out those reference sheets with the hidden steps, logins and passwords to speak with such things as MySQL, phpMyAdmin, Apache, “root” Linux access and the rest. Guilty as charged.

Yet, on the other hand, writing and maintaining a blog is a means — not an end — for me. Sure, I can learn this stuff, spend the time, and even blog about it on occasion. But that is not my daily focus or passion.

It is one thing to set up a blog, try a few posts, and then move on. Literally millions of people have done so.

Yet there are also many hundreds of thousands of content-rich sites that have become more-or-less permanent and often destinations in their own right. As the space matures, so must the platforms upon which it is based. I hope that WordPress will continue to be that choice for me, but some key structural work on its foundation is well in order.

Posted by AI3's author, Mike Bergman Posted on June 7, 2007 at 2:35 pm in Blogs and Blogging, Site-related, Software Development | Comments (7)
The URI link reference to this post is: http://www.mkbergman.com/382/wordpress-v-22-upgrade-exposes-cracks-in-the-foundation/
The URI to trackback this post is: http://www.mkbergman.com/382/wordpress-v-22-upgrade-exposes-cracks-in-the-foundation/trackback/

An annual birthday is time to take some stock, and to do some tuning up. This week it was my blog’s turn. A subsequent post will report on that (painful) experience. I gave a general spit polish to the ol’ site to say thanks! and Happy 2nd Birthday!

The actual birth date of this site is May 27, 2005. In the two ensuing years, I have posted 208 articles, most quite long. Over the past 12 months, I posted 100 articles, unveiled my SweetSearch Google custom search engine for the semantic Web, and unveiled and grew the Sweet Tools listing of 500+ semantic Web and related tools listing. I added multiple language translation (and then removed it when Google announced its own service!) and began testing Google ads.

My site popularity has continued to climb; AI3 is now ranked about 45,000 on Technorati (it was about 100,000 on the site’s first birthday), pretty good for a site mostly focused on technical issues of Internet content and the structured Web, and I get good commentary both on my blog and around the Web.

Thanks, everyone, for your kind words and support!

So, all in all, this blog’s toddling year was a good one. Now that we’re walking, perhaps some running and jumping in the coming year will be in order! Wink

Posted by AI3's author, Mike Bergman Posted on June 7, 2007 at 1:27 pm in Site-related | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/379/happy-second-birthday-ai3/
The URI to trackback this post is: http://www.mkbergman.com/379/happy-second-birthday-ai3/trackback/
Posted:June 3, 2007

To All,

You may encounter poor site performance and other weird behavior for the next day or so. I’m upgrading the site to WordPress 2.2, as well as adding caching to improve site load times and other changes. Plug-ins are breaking left and right, and the going is pretty rough.

Pardon the interruption, and thanks for your patience!

Posted by AI3's author, Mike Bergman Posted on June 3, 2007 at 12:23 pm in Site-related | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/378/pardon-the-interruption/
The URI to trackback this post is: http://www.mkbergman.com/378/pardon-the-interruption/trackback/
Posted:June 1, 2007

A remarkable 7-min video demonstration from a recent TED talk by Blaise Agüera y Arcas has just been posted online. If any of you have wondered what benefits a linked Web of data might bring, this is the talk to see. Simply stated, all I can say is, Wow!:

Photosynth TED Presentation

The actual TED talk link for Blaise’s presentation is found here.

The first remarkable piece of technology in this demo is from Seadragon, founded by Blaise and acquired last year by Microsoft. This part of the technology allows visual information to be smoothly browsed, panned or zoomed regardless of the amount of data involved or the bandwidth of the network. Watch as you zoom from all the pages of an entire book to an individual letter! This part of the technology has obvious applications to maps, but frankly, to any large data space.

The second portion of the technology, and perhaps an even more remarkable part of the demo, marries this part with a means to aggregate multiple images into an interactive, immersive, 3-D whole. This technology, originally developed at the University of Washington, is called Photosynth.

The Photosynth software takes a large collection of photos of a place or an object, analyzes them for similarities, and then displays the photos in a reconstructed three-dimensional space, showing you how each one relates to the next. This software is capable of assembling static photos — in the case of the demo, hundreds of photos of the Notre Dame cathedral publicly available on Flickr — into a synergy of zoomable, navigable, interactive and “immersive” spaces.

This example shows how individual data “objects” sharing a similar tag can be aggregated and put to a use as a new “whole” that is much, much greater than the sum of its parts. It is this kind of emergent quality that gives the promise of the structured Web and the semantic Web its power: much more can be done with linked data than with individual documents.

To better understand the amazing technology behind all of this magic, I recommend the geeks in the crowd check out this earlier and longer (37 min) video on Blaise and his work at Microsoft Live Labs; it is found at the Channel 9 MSDN site.

Thanks to Christian Long and his think:lab blog for first posting about this TED talk.

Posted by AI3's author, Mike Bergman Posted on June 1, 2007 at 11:41 am in Adaptive Information, Semantic Web | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/377/semantically-linked-interactive-imagery-wow-the-emergent-web-in-action/
The URI to trackback this post is: http://www.mkbergman.com/377/semantically-linked-interactive-imagery-wow-the-emergent-web-in-action/trackback/
Posted:May 29, 2007

Donald Knuth's Road SignThe Why, How and What of the Semantic Web are Becoming Clear — Yet the Maps and Road Signs to Guide Our Way are Largely Missing

There has been much recent excitement surrounding RDF, its linked data, “RDFizers” and GRDDL to convert existing structured information to RDF. The belief is that 2007 is the breakout year for the semantic Web.

The why of the semantic Web is now clear. The how of the semantic Web is now appearing clear, built around RDF as the canonical data model and the availability and maturation of a growing set of tools. And the what is also becoming clear, with the massive new datastores of DBpedia, Wikipedia3, the HCLS demo, Musicbrainz, Freebase, and the Encyclopedia of Life only being some of the most recent and visible exemplars.

Yet the where aspect seems to be largely missing.

By where I mean: Where do we look for our objects or data? If we have new objects or data, where do we plug into the system? Amongst all possible information and domains, where do we fit?

These questions seem simplistic or elemental in their basic nature. It is almost incomprehensible to wonder where all of this data now emerging in RDF relates to each other — what the overall frame of reference is — but my investigations seem to point to such a gap. In other words, a key piece of the emerging semantic Web infrastructure — where is this stuff — seems to be missing. This gap is across domains and across standard ontologies.

What are the specific components of this missing where, this missing infrastructure? I believe them to be:

  • Lack of a central look-up point for where to find this RDF data in reference to desired subject matter
  • Lack of a reference subject context for where this relevant RDF data fits; where can we place this data in a contextual frame of reference — animal, mineral, vegetable?
  • Lack of a open means where any content structure — from formal ontologies to “RDFized” documents to complete RDF data sets — can “bind” or “map” to other data sets relevant to its subject domain
  • Lack of a registration or publication mechanism for data sets that do become properly placed, the where of finding SPARQL or similar query endpoints, and
  • In filling these gaps, the need for a broad community process to give these essential infrastructure components legitimacy.

I discuss these missing components in a bit more detail below, concluding with some preliminary thoughts on how the problem of this critical infrastructure can be redressed. The good news, I believe, is that these potholes on the road to the semantic Web can be relatively easily and quickly filled.

The Lack of Road Signs Causes Collisions and Missed Turns

I think it is fair to say that structure on the current Web is a jumbled mess.

As my recent Intrepid Guide to Ontologies pointed out, there are at least 40 different approaches (or types of ontologies, loosely defined) extant on the Web for organizing information. These approaches embrace every conceivable domain and subject. The individual data sets using these approaches span many, many orders of magnitude in size and range of scope. Diversity and chaos we have aplenty, as the illustrative diagram of this jumbled structural mess shows below.

Jumbled and Diverse Formalisms
[Click on image for full-size pop-up]

Mind you, we are not yet even talking about whether one dot is equivalent or can be related to another dot and in what way (namely, connecting the dots via real semantics), but rather at a more fundamental level. Does one entire data set have a relation to any other data set?

Unfortunately, the essential precondition of getting data into the canonical RDF data model — a challenge in its own right — does little to guide us as to where these data sets exist or how they may relate. Even in RDF form, all of this wonderful RDF data exists as isolated and independent data sets, bouncing off of one another in some gross parody of Brownian motion.

What this means, of course, is that useful data that could be of benefit is overlooked or not known. As with problems of data silos everywhere, that blindness leads to unnecessary waste, incomplete analysis, inadequate understanding, and duplicated effort [1].

These gaps were easy to overlook when the focus of attention was on the why, what and how of the semantic Web. But, now that we are seeing RDF data sets emerge in meaningful numbers, the time is ripe to install the road signs and print up the maps. It is time to figure out where we want to go.

The Need for a Lightweight Subject Mapping Layer

As I discussed in an earlier posting, there’s not yet enough backbone to the structured Web. I believe this structure should firstly be built around a lightweight subject- or topic-oriented reference layer.

An umbrella subject reference becomes the “super-structure” to which other specific ontologies can place themselves in an “info-spatial” context.

Unlike traditional upper-level ontologies (see the Intrepid Guide), this backbone is not meant to be comprised of abstract concepts or a logical completeness of the “nature of knowledge”. Rather, it is meant to be only the thinnest veneer of (mostly) hierarchically organized subjects and topic references (see more below).

This subject or topic vocabulary (at least for the backbone) is meant to be quite small, likely more than a few hundred reference subjects, but likely less than many thousands. (There may be considerably more terms in the overall controlled vocabulary to assist context and disambiguation.)

This “umbrella” subject structure could be thought of as the reference subject “super-structure” to which other specific ontologies could place themselves in a sort of locational or “info-spatial” context.

One way to think of these subject reference nodes is as the major destinations — the key cities, locations or interchanges — on the broader structured Web highway system. A properly constructed subject structure could also help disambiguate many common misplacements by virtue of the context of actual subject mappings.

For example, an ambiguous term such as “driver” becomes unambiguous once it is properly mapped to one of its possible topics such as golf, printers, automobiles, screws, NASCAR, or whatever. In this manner, context is also provided for other terms in that contributing domain. (For example, we would now know how to disambiguate “cart” as a term for that domain.)

A high-level and lightweight subject mapping layer does not warrant difficult (and potentially contentious) specificity. The point is not to comprehensively define the scope of all knowledge, but to provide the fewest choices necessary for what subject or subjects a given domain ontology may appropriately reference. We want a listing of the major destinations, not every town and parish in existence.

(That is not to say that more specific subject references won’t emerge or be appropriate for specific domains. Indeed, the hope is that an “umbrella” reference subject structure might be a tie-in point for such specific maps. The more salient issue addressed here is to create such an “umbrella” backbone in the first place.)

This subject reference “super-structure” would in no way impose any limits on what a specific community might do itself with respect to its own ontology scope, definition, format, schema or approach. Moreover, there would be no limit to a community mapping its ontology to multiple subject references (or “destinations”, if you will).

The reason for this high-level subject structure, then, is simply to provide a reference map for where we might want to go — no more, no less. Such a reference structure would greatly aid finding, viewing and querying actual content ontologies — of whatever scope and approach — wherever that content may exist on the Web.

This is not a new idea. About the year 2000 the topic map community was active with published subject indicators (PSIs) [2] and other attempts at topic or subject landmarks. For example, that report stated:

The goal of any application which aggregates information, be it a simple back-of-book index, a library classification system, a topic map or some other kind of application, is to achieve the “collocation objective;” that is, to provide binding points from which everything that is known about a given subject can be reached. In topic maps, binding points take the form of topics; for a topic map application to fully achieve the collocation objective there must be an exact one-to-one correspondence between subjects and topics: Every topic must represent exactly one subject and every subject must be represented by exactly one topic.
When aggregating information (for example, when merging topic maps), comparing ontologies, or matching vocabularies, it is crucially important to know when two topics represent the same subject, in order to be able to combine them into a single topic. To achieve this, the correspondence between a topic and the subject that it represents needs to be made clear. This in turn requires subjects to be identified in a non-ambiguous manner.

The identification of subjects is not only critical to individual topic map applications and to interoperability between topic map applications; it is also critical to interoperability between topic map applications and other applications that make explicit use of abstract representations of subjects, such as RDF.

From that earlier community, Bernard Vatant has subsequently spoken of the need and use of “hubjects” as organizing and binding points, as has Jack Park and Patrick Durusau using the related concept of “subject maps” [3]. An effort that has some overlap with a subject structure is also the Metadata Registry being maintained by the National Science Digital Library (NSDL).

However, while these efforts support the idea of subjects as partial binding or mapping targets, none of them actually proposed a reference subject structure. Actual subject structures may be a bit of a “third rail” in ontology topics due to the historical artifact of wanting to avoid the pitfalls of older library classification systems such as the Dewey Decimal Classification or the Library of Congress Subject Headings.

Be that as it may. I now think the timing is right for us to close this subject gap.

A General Conceptual Model

This mapping layer lends itself to a three-tiered general conceptual model. The first tier is the subject structure, the conceptualization embracing all possible subject content. This referential layer is the lookup point that provides guidance for where to search and find “stuff.”

The second layer is the representation layer, made up of informal to “formal” ontologies. Depending on the formalism, the ontology provides more or less understanding about the subject matter it represents, but at minimum binds to the major subject concepts in the top subject mapping layer.

The third layer are the data sets and “data spaces” [4] that provide the actual content instantiations of these subjects and their ontology representations. This data space layer is the actual source for getting the target information.

Here is a diagram of this general conceptual model:

Three-tiered Conceptual Model
[Click on image for full-size pop-up]

The layers in this general conceptual model progress from the more abstract and conceptual at the upper level, useful for directing where traffic needs to go, to concrete information and data at the lower level, the real object of manipulation and analysis.

The data spaces and ontologies of various formalisms in the lower two tiers exist in part today. The upper mapping layer does not.

By its nature the general upper mapping layer, in its role as a universal reference “backbone,” must be somewhat general in the scope of its reference subjects. As this mapping infrastructure begins to be fleshed out, it is also therefore likely that additional intermediate mapping layers will emerge for specific domains, which will have more specific scopes and terminology for more accurate and complete understanding with their contributing specific data spaces.

Six Principles for a Possible Lightweight Binding Mechanism

So, let’s beg the question for a moment of what an actual reference subject structure might be. Let’s assume one exists. What should we do with this structure? How should we bind to it?

  • First, we need to assume that the various ontologies that might bind to this structure reside in the real world, and have a broad diversity of domains, topics and formality of structure. Therefore, we should: a) provide a binding mechanism responsive to the real-world range of formalisms (that is, make no grand assumptions or requirements of structure; each subject structure will be be provided as is); and b) thus place the responsibility to register or bound the subject mapping assignment(s) to the publisher of the contributing content ontology [5].
  • Second, we can assume that the reference subject structure (light green below) and its binding ontology basis (dark green) are independent actors. As with many other RESTful services, this needs to work in a peer-to-peer (P2P) manner.
  • Third, as the Intrepid Guide argues, RDF and its emergent referential schema provide the natural data model and characterization “middle ground” across all Web ontology formalisms. This observation leads to SKOS as the organizing schema for ontology integration, supplemented by the related RDF schema of DOAP, SIOC, FOAF and Geonames for the concepts of projects, communities, people and places, respectively. Other standard referents may emerge, which should also be able to be incorporated.
  • Fourth, the actual binding point of “subjects” are themselves only that: binding points. In other words, this makes them representative “proxies” not to be confused with the actual subjects themselves. This implies no further semantics than a possible binding, and no assertion about the accuracy, relevance or completeness of the binding. How such negotiation may resolve itself needs to be outside of this scope of a simple mapping and binding reference layer (see again [3]).
  • Fifth, the binding structure and its subject structure needs to have community acceptance; no “wise guys” here, just well-intentioned bozos.
  • And, sixth, keep it simple. While not all publishers of Web sites need comply — and the critical threshold is to get some of the major initial publishers to comply — the truth like everything else on the Web is that the network effect makes thinks go. By keeping it simple, individual publishers and tool makers are more likely to use and contribute to the system.

How this might look in a representative subject binding to two candidate ontology data sets is shown below:

Example Binding Structure

This diagram shows two contributing data sets and their respective ontologies (no matter how “formal”) binding to a given “subject.” This subject proxy may not be the only one bound by a given data set and its ontology. Also note the “subject” is completely arbitrary and, in fact, is only a proxy for the binding to the potential topic.

One Possible Road Map

Clearly, this approach does not provide a powerful inferential structure.

But, what it does provide is a quite powerful organizational structure. Access and the where for the sake of simplicity and adoption are given preference over inferential elegance. Thus, assuming now, again, a subject structure backbone, matched with these principles and the same jumbled structures as first noted above, we can now see an organizational order emerge from the chaos:

Lightweight Binding to an Upper Subject Structure Can Bring Order
[Click on image for full-size pop-up]

The assumptions to get to this point are not heroic. Simple binding mechanisms matched with a high-level subject “backbone” are all that is required mechanically for such an approach to emerge. All we have done to achieve this road map is to follow the trails above.

Use Existing Consensus to Achieve Authority

So, there is nothing really revolutionary in any of the discussion to this point. Indeed, many have cited the importance of reference structures previously. Why hasn’t such a subject structure yet been developed?

One explanation is that no one accepts any global “external authority” for such subject identifications and organization. The very nature of the Web is participatory and democratic (with a small “d”). Everyone’s voice is equal, and any structure that suggests otherwise will not be accepted.

It is not unusual, therefore, that some of the most accepted venues on the Web are the ones where everyone has an equal chance to participate and contribute: Wikipedia, Flickr, eBay, Amazon, Facebook, YouTube, etc. Indeed, figuring out what generates such self-sustaining “magic” is the focus of many wannabe ventures.

While that is not our purpose here, our purpose is to set the preconditions for what would constitute a referential subject structure that can achieve broad acceptance on the Web. And in the paragraph above, we already have insight into the answer: Build a subject structure from already accepted sources rich in subject content.

A suitable subject structure must be adaptable and self-defining. These criteria reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.

One obvious foundation to building a subject structure is thus Wikipedia. That is because the starting basis of Wikipedia information has been built entirely from the bottom up — namely, what is a deserving topic. This has served Wikipedia and the world extremely well, with now nearly 1.8 million articles online in English alone (versions exist for about 100 different languages) [6]. There is also a wealth of internal structure within Wikipedia’s “infobox” templates, structure that has been utilized by DBpedia (among others) to actually transform Wikipedia into an RDF database (as I described in an earlier article). As socially-driven and -evolving, I foresee Wikipedia to continue to be the substantive core at the center of a knowledge organizational framework for some time to come.

But Wikipedia was never designed with an organizing, high-level subject structure in mind. For my arguments herein, creating such an organizing (yes, in part, hierarchical) structure is pivotal.

One innovative approach to provide a more hierarchical structural underpinning to Wikipedia has been YAGO (“yet another great ontology”), an effort from the Max-Planck-Institute Saarbrücken [7]. YAGO matches key nouns between Wikipedia and WordNet, and then uses WordNet’s well-defined taxonomy of synsets to superimpose the hierarchical class structure. The match is more than 95% accurate; YAGO is also designed for extensibility with other quality data sets.

I believe YAGO or similar efforts show how the foundational basis of Wikipedia can be supplemented with other accepted lexicons to derive a suitable subject structure with appropriate high-level “binding” attributes. In any case, however constructed, I believe that a high-level reference subject structure must evolve from the global community of practice, as has Wikipedia and WordNet.

I have previously described this formula as W + W + S + ? (for Wikipedia + WordNet + SKOS + other?). There indeed may need to be “other” contributing sources to construct this high-level reference subject structure. Other such potential data sets could be analyzed for subject hierarchies and relationships using fairly well accepted ontology learning methods. Additional techniques will also be necessary for multiple language versions. Those are important details to be discussed and worked out.

The real point, however, is that existing and accepted information systems already exist on the Web that can inform and guide the construction of a high-level subject map. As the contributing sources evolve over time, so could periodic updates and new versions of this subject structure be generated.

Though the choice of the contributing data sets from which this subject structure could be built will never be unanimous, using sources that have already been largely selected through survival of the “fittest” by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference structure — and not a complete closed-world definition — we are also setting realistic thresholds for acceptance.

Conclusion and Next Steps

The specific topic of this piece has been on a subject reference mapping and binding layer that is lightweight, extensible, and reflects current societal practice (broadly defined). In the discussion, there has been recognition of existing schema in such areas as people (FOAF), projects (DOAP), communities (SIOC) and geographical places (Geonames) that might also contribute to the overall binding structure. There very well may need to be some additional expansions in other dimensions such as time and events, organizations, products or whatever. I hope that a consensus view on appropriate high-level dimensions emerges soon.

There are a number of individuals presently working on a draft proposal for an open process to create this subject structure. What we are working quickly to draft and share with the broader community is a proposal related to:

  1. A reference umbrella subject binding ontology, with its own high-level subject structure
  2. Lightweight mechanisms for binding subject-specific community ontologies to this structure
  3. Identification of existing data sets for high-level subject extraction
  4. Codification of high-level subject structure extraction techniques
  5. Identification and collation of tools to work with this subject structure, and
  6. A public Web site for related information, collaboration and project coordination.

We believe this to be an exciting and worthwhile endeavor. Prior to the unveiling of our public Web site and project, I encourage any of you with interest in helping to further this cause to contact me directly at mike at mkbergman dot com [8].


[1] I attempted to quantify this problem in a white paper from about two years ago, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents, BrightPlanet Corporation, 42 pp., July 20, 2005. Some reasons for how such waste occurs were documented in a four-part series on this AI3 blog, Why Are $800 Billion in Document Assets Wasted Annually?, beginning in October 2005 through parts two, three and four concluding in November 2005.

[2] See Published Subjects: Introduction and Basic Requirements (OASIS Published Subjects Technical Committee Recommendation, 2003-06-24.

[3] See especially Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies.

[4] The concept of “data spaces” has been well-articulated by Kingsley Idehen of OpenLink Software and Frédérick Giasson of Zitgist LLC. A “data space” can be personal, collective or topical, and is a virtual “container” for related information irrespective of storage location, schema or structure.

[5] If the publisher gets it wrong, and users through the reference structure don’t access their desired content, there will be sufficient motivation to correct the mapping.

[6] See Wikipedia’s statistics sections.

[7] Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, “Yago – A Core of Semantic Knowledge” (also in bib or ppt). Presented at the 16th International World Wide Web Conference (WWW 2007) in Banff, Alberta, on May 8-12, 2007. YAGO contains over 900,000 entities (like persons, organizations, cities, etc.) and 6 million facts about these entities, organized under a hierarchical schema. YAGO is available for download (400Mb) and converters are available for XML, RDFS, MySQL, Oracle and Postgres. The YAGO data set may also be queried directly online.

[8] I’d especially like to thank Frédérick Giasson and Bernard Vatant of Mondeca for their reviews of a draft of this posting. Fred was also instrumental in suggesting the ideas behind the figure on the general conceptual model.

Posted by AI3's author, Mike Bergman Posted on May 29, 2007 at 5:14 pm in Adaptive Information, Semantic Web, Structured Web | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/375/where-are-the-road-signs-for-the-structured-web/
The URI to trackback this post is: http://www.mkbergman.com/375/where-are-the-road-signs-for-the-structured-web/trackback/