Posted:February 6, 2009

Automatic Date Search with Google on Firefox

A Few Tricks can Give Your Google Searches More Firepower

This post tells you how to easily modify your Firefox browser to enable constant date-filtered searching in Google. This powerful and useful feature is not generally known as even available on the Google engine, and even then is not easily used from within Google. BTW, if you want to cut to the chase about what I recommend, see Option 3 below.

I’m a research hound and do much monitoring of Web sites and queries. I more often than not have too many tabs open in my Firefox browser and I am an aggressive user of Firefox’s integrated Web search, the Search Bar, which is found to the right of the location bar (see image to left).

In monitoring mode, I often want to see what is new or recently updated. I also find it helpful to filter results by date stamp in order to find the freshest stuff or to find again that paper I know got published a couple of years back.

For reasons only known to Google, the only way you can restrict searches by date is to go to the advanced search form page, expand a Javascript section, and then pick your restrictions. Too many clicks! And not reusable!

The image below shows this date search option on the advanced Google search form. (Advanced search is always available as a link option to the right of the standard Google search box.) After expanding the ‘Date, usage rights, numeric range, and more‘ section and then picking a Date option from the dropdown, we see:

The dropdown presents only a few options — though helpful — of anytime, past 24 hours, past week, past month and past year with respect to the choices for date filtering.

(BTW, the Web is notorious for not having good date stamp conventions or uniform application of same. As best as I can tell, Google generally date stamps its specific Web pages — URLs — when first indexed in its system. Later minor updates do not apparently result in a date stamp update. However, some highly visited and dynamic sites, such as for example CNN, do seem to have date stamps updated as of time of most recent harvest. This seems to be an issue more for restricting searches to, say, the past 24 hours, involving news or high-traffic sites. Date stamps seem to work pretty well for “normal” Web pages and sites.)

One obvious real problem is the number of clicks it takes to invoke this date option. Another problem is the limited choices available once so done.

Can’t we bring a little bit of automation and control to this process?

And, indeed we can, as the next three options discuss.

Option 1: Direct Manual Changes

Search geeks learn to look to the tips or search conventions for a search engine or to learn the “codes” embedded in the search string URL to discover other “hidden” search goodies. While sometimes overlooked at the bottom of its search page, Google has some helpful search features and advanced search tips that are worthwhile for the serious searcher to study.

The search query string submitted via URL is also worth inspecting. For example, for the screen capture above, which is searching on my standard handle of ‘mkbergman‘, this is what gets issued to Google:

http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&hs=rsw &as_q=mkbergman&as_epq=&as_oq=&as_eq=&num=100&lr=&as_filetype=&ft=i&as_sitesearch= &as_qdr=y&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images

Quick inspection of this search string shows us a couple of things.

First, in this case, different variables are declared and separated with the ampersand ( ‘&‘) character. If there is a parameter assignment for that variable, it occurs after the equals (‘=‘) sign. The actual query issued is denoted by ‘q‘. The number of results I asked for is defined by ‘num‘ as 100, and so on.

From the standpoint of the date stamp, the variable we are interested in is ‘as_qdr‘ (which stands for, I think, advanced search, query date range, though Google does not publicly define this as far as I know). In this case, we have set our parameter to ‘y‘, equivalent to what our label above indicates as past year.

The easiest way to use this command is to simply append it (such as ‘&asqdr=y‘) to the end of the query URL in your location bar; here is an example for my recently started company looking for pages in the past year:

http://www.google.com/search?q=”structured dynamics”&as_qdr=y

These search variable codes and what the site accepts are sometimes defined, but generally on the search engine’s Web site. For this Google variable, I first came across this date variable explication from Matt Cutts (who was picking up on a tip from Alex Chitu, though apparently the command goes back to at least 2003), which explained you can:

Change the value of as_qdr to custom intervals. Here are some the possible values of the as_qdr parameter:

d[number] – past number of days (e.g.: d10)
w[number] – past number of weeks
y[number] – past number of years

So, to change the search to the past 5 days change the =d to =d5 for 5 days, or =w10 for ten weeks, or =y3 for 3 years! A nice aspect is that the dropdown date search form remains on the page for subsequent searches during the current session.

Option 2: Automatic Location Bar

Well, if you are like me and this kind of functionality would be of constant benefit, why not make it part of your standard search profile?

The first consideration is that I want the date search box to always appear, but I don’t know in advance the period I want to specify. My first thought was to somehow specify the anytime parameter, but I could not figure it out. The next option I tested was to up the year period to an extremely long time such that all results on the service were standardly shown. That way, I could winnow back from there.

After testing I found that the system would not accept 20 years, but does accept 19! That seems to cover the Web history nicely, so I now had a standard parameter to add to get the date search box to appear: &as_qdr=y19

The next consideration was where to specify this standard. My first attempt was to replace the standard search that takes place in the location bar. To do so, there is an internal Firefox convention for getting at internal settings, about:config, which you enter in the location bar. Once entered, you also get a scary (and amusing) warning message, as this screen shot shows:

If you proceed on from there you get a massive listing of internal settings, listed alphabetically. The one we are interested in is keyword.URL, which can be obtained by entering in the Filter box or scrolling down the long list. Once found, right click on the listing and choose Modify:

That enables you to change the default string value. So, enter the standard &as_qdr=y19 right before the query (‘&q=‘) variable at the end of the string:

Of course, you can specify any parameter of your own choosing if you want the default behavior to be different.

Option 3: Search Engines Toolbar

This location bar option generally works well, but for many searches initiated from the location bar Google may take you to only a single result based on its ‘I’m Feeling Lucky’ option. That may not be the behavior you want; it wasn’t for me.

My preferred approach is to treat date searching just as another specialty search engine. Thus, that suggests changing the search engine parameters that occur in the Firefox Search Bar (see first image on this page).

Default search engines are stored in the Firefox installation directory’s “searchplugins” folder. (Newly-installed search engines are stored in the Firefox profile folder “searchplugins” folder, so you can have search engine files in both locations.) Search engines are stored as single *.xml files (e.g., google.xml in our specific case) in Firefox 2 and above.

So, to add our new date search filter, you will need to add this line (or whatever your specific variant might be) to your google.xml file:

After saving, quit and re-start Firefox.

And then, voila! We now have a standard date filter on each and every one of our Google searches initiated from the Search Bar !:

😉

Some Closing Notes

These same basic techniques can be used for other common search patterns. For example, my own google.xml file also includes the num=100 statement because I like long results sets
You can duplicate these files for variants. For example, I often search for filetype:pdf because they often are the higher quality published papers. I then duplicate the entire xml file, give it a different name, and now have it available in my search engine dropdown list box when that specialty search is warranted
In fact, we tend to have a monocultural diet when it comes to search engines. But despite the gorilla Google, there are many speciality search options out there, some with their own options for various restrictions and filters. There are scores of engines suitable to the Firefox search dropdown that can be enhanced in similar manner
I know these techniques apply equally well to Yahoo!, but did not spend the time duplicating it as an example. I suspect these techniques in fact can apply to any major search engine
Both implied and explicit, I recommend all serious searchers spend time learning the syntax and options of their favorite search engine. And, then, in keeping with the spirit of this posting, I recommend fairly focused and ongoing monitoring of Web developments in these respects.

The important message, of course, to all of this is the admonition: Know Thy Tools!

Posted:February 2, 2009

Sweet Tools Updated to 736 Tools

Sweet Tools Listing

This AI3 blog maintains Sweet Tools, the largest listing of about 800 semantic Web and -related tools available. Most are open source. Click here to see the current listing!

Most Recent Update Adds 5% to this Listing of Semantic Web Tools

AI3's listing of semantic Web and -related tools has now gotten its first major update in six months, increasing its listing by 33 new entries (a couple have been around for a bit but were only recently discovered) to a new total of 736. See the major Sweet Tools page for this updated listing; filter on ‘New’ under New or Existing? to see the recent additions.

A parallel listing is maintained by the Semantic Web Company with a very attractive presentation. They have also been aiding greatly in the general maintenance of the list.

Background on prior listings and earlier statistics may be found on these previous posts:

Sweet Tools Listing Now Exceeds 700 Tools (July 5, 2008)
Sweet Tools Updated, Opened for Collaboration (Mar. 31, 2008)
Sweet Tools Updated to 650 Tools (Nov. 18, 2007)
New Release: 578 Semantic Web and -related Tools (Sept. 16, 2007)
542 Semantic Web and -related Tools (Jun. 19, 2007)
Listing of 500 Semantic Web and Related Tools (Mar. 11, 2007)
Sweet Tools Updated to 420 Tools (Feb. 7, 2007)
Converting 'Sweet Tools' to an Exhibit (Jan. 22, 2007)
Permanent Sweet Tools Listing — 400+ Tools and Counting! (Jan. 5, 2007)
Comprehensive Listing of 250 Semantic Web Tools (updated) (Oct. 4, 2006)
Comprehensive Listing of 175 Semantic Web Tools (Sep. 22, 2006)
Current Listing of 70 Semantic Web Tools (Aug. 12, 2006)

With interim updates periodically over that period.

Please use ‘Comments’ on this post for suggestions or additions to the listing.

Posted:January 22, 2009

‘Structs’: Naïve Data Formats and the ABox

Writing and Sharing Data Can be Lightened Up

Ever since I first started to learn in earnest about ontology, something has been gnawing at me. The term seemed to be (shall I say?) an obtuse one whose obscurity was not the result of subtle precision or technicality, but rather one of fuzziness. As I introduced my Intrepid Guide to Ontology two years ago, I noted:

The root of the [ontology] term is the Greek ontos, or being or the nature of things. Literally and in classical philosophy, ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”

Since then, I have continued to find ontology one of the hardest concepts to communicate to clients and quite a muddled mess even as used by practitioners. I have come to the conclusion that this problem is not because I have failed to grasp some ephemeral nuance, but because the term as used in practice is indeed fuzzy and imprecise.

What Isn’t an Ontology?

Even two years ago, I noted more than 40 different types of information structure that have at one time or another been labelled as an example of an “ontology”:

Since then, I could add even more terms to this list.

Lack of precision as to what ontology means has meant that it has been sloppily defined. As I have harped upon many times regarding semantic Web terminology, this is a sad state of affairs for the semWeb endeavor that has meaning at the core of its purpose.

I’m pretty sure that the original intent in embracing the concept of ontology within the realm of knowledge representation was not to see this term so broadly misused or mis-applied. I suspect, as well, that if we could sharpen up our understanding and remove some of the fuzziness that we could improve communications with the lay public across many levels of the semWeb enterprise.

The Useful Distinction of the TBox and ABox

Recently, I have been looking to the semantic Web’s roots in description logics. One of my writings, Thinking ‘Inside the Box’ with Description Logics, looked at the conceptual distinctions between the so-called ‘TBox‘ and ‘ABox‘. That is, a knowledge base is a logical schema of roles and concepts and the relationships between them (the TBox), which is populated by the actual data (instances) asserting memberships and attributes (“facts”) (the ABox).

By analogy, in a conventional relational database system, the database or logical schema would correspond to the TBox; the actual data records or tables would correspond to the ABox. Often, the term ontology is used to cover both ABox and TBox statements (which, I argue, only makes the understanding of the ‘ontology’ concept more difficult).

My recent writing, Back to the Future with Description Logics, discussed at some length the advantages of keeping the TBox and ABox separate. This current article now expands on those thoughts, particularly with respect to the definition and understanding of ontology.

The starting point for this new mindset is to return to the ideas of data records or data tables v. the logical schema that is prevalent in relational databases.

So Many Structs, So Little Time

The last time I took a census, about a year ago, there were more than 100 converters of various record and data structure types to RDF [2]. These converters — also sometimes known as translators or ‘RDFizers’ — generally take some input data records with varying formats or serializations and convert them to a form of RDF serialization (such as RDF/XML or N3), often with some ontology matching or characterizations. That last census listed these converters:

RDF
- Serialization formats:
- - RDF/XML
  - N3
  - Turtle
- Automatically recognized ontologies:
- - SIOC
  - SKOS
  - FOAF
  - AtomOWL
  - Annotea
  - Music Ontology
  - Bibliographic Ontology
  - EXIF
  - vCard
  - Others
(X)HTML pages
HTML header metadata
- Dublin Core
Embedded microformats
- eRDF
- RDFa
- hCard
- hCalendar
- XFN
- xFolk
Syndication Formats:
- RSS 2.0
- Atom
- OPML
- OCS
- XBEL (for bookmarks)
GRDDL [1]

REST-style Web service APIs:
- Google Base
- Flickr
- Del.icio.us
- Ning
- Amazon
- eBay
- Freebase
- Facebook
- raw HTTP
- Etc.
Files (multitude of file formats and MIME types, including):
- MS Office
- OpenOffice
- Open Document Format
- images
- audio
- video
- Etc.
Web services:
- BPEL
- WSDL
- XBRL
- XBEL
Data exchange formats
- iCalendar
- vCard
Virtuoso VADs
OpenLink license files
Third party metadata extraction frameworks:
- Aperture
- Spotlight
- SIMILE RDFizers

Note that MIT’s SIMILE RDFizers also recognizes these formats:

JPEG → RDF
MARC/MODS → RDF
OAI-PMH → RDF
OCW → RDF
EMail → RDF
BibTEX → RDF
POM → RDF
DEB → RDF
CRW → RDF
Flat → RDF
Weather → RDF
Java → RDF
Javadoc → RDF
Jira → RDF
Subversion → RDF
Random → RDF

There is a growing list of third-party RDFizers as well:

LDIF → RDF
iCal → RDF
Palm → RDF
Outlook → RDF
RFC822 → RDF
Garmin → RDF
Fink → RDF
D2RQ
D2RMAP
XLS → RDF
CSV → RDF
RDF123
XSD → OWL
XML → RDF
MPEG-7/CS → OWL
XMP → RDF
BitTorrent → RDF
RPM → RDF
BibTeX → RDF

This wealth of formats shows the robustness of the RDF data model to capture structure and data relationships from virtually any input form. This is what makes RDF so exciting as a canonical target for getting data to interoperate.

Let’s Make this Elementary, Dr. Watson

However — and this is crucial — standard users for decades have preferred simple, text-based and human readable formats for writing and transferring their structured data.

These various forms, sometimes well specified with APIs and sometimes almost ad hoc as in spreadsheet listings, are what we call ‘structs‘. Structs can all be displayed as text and have, at minimum, explicit or inferrable key-value pairs to convey data relationships and attributes, with data types and values often noted by various white space, delimiter or other text conventions.

There is no doubt that the vast majority of extant data is found in such formats, including the results of data or information extraction from unstructured text. Indeed, even HTML and many markup languages with their angle bracket-delimited fields fall into this category.

There have literally been hundreds of various formats proposed over decades for conveying lightweight data structures. Most have been proprietary or limited to specific domains or users. Some, such as fielded text, structured text, simple declarative language (SDL), or more recently YAML or its simpler cousin JSON, have become more widely adopted and supported by formal specifications, tools or APIs. JSON, especially, is a preferred form for Web 2.0 applications.

Some, like microformats or this example BibTeX record below (with some non-standard extensions), rely less on syntax conventions and may use reserved keywords (such as AUTHOR or TITLE as shown) to signal the key type for the key-value pair:

ID_LOCAL arXiv:0711.3808
AUTHOR <a href="#Schramm_O">Oded Schramm</a>
BIBTYPE ARTICLE
ID arXiv:0711.3808
JOURNAL Electron. Res. Announc. Math. Sci.
PAGES 17--23
SUBJECTS geom
TITLE Hyperfinite graph limits
URL http://www.aimsciences.org/journals/doIpChk.jsp?paperID=3117&mode=full
URL http://www.aimsciences.org/journals/displayPapers0.jsp?comments=&pubID=221&journID=14&pubString_num=Volume:
15, 2008 Journal Issue
VOLUME 15
YEAR 2008

Some of these simple formats have been more successful than others, though none have achieved market dominance. There also appear to be few universal principles that have emerged as to syntax or format. Nonetheless, any of these various struct forms are easy for casual readers to understand and easy for domain experts to write.

For modeling and interoperability purposes, many of these forms are patently inadequate. That is why many of these simpler forms might be called “naïve”: they achieve their immediate purpose of simple relationships and communication, but require understood or explicit context in order to be meaningfully (semantically) related to other forms or data.

Yet, if we have learned nothing else with the phenomenal success of the Web it is this: simplicity trumps elegance or expressivity.

RDF and the Skinny ABox

The RDF (Resource Description Framework) data model is expressed as simple subject–predicate–object “triple” statements. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or, the ball is round. It may sound like a kindergartner reader, but it is how data can be easily represented and built up into more complex structures and stories.

RDF triples can be applied equally to all structured, semi-structured and unstructured content. RDF is clearly a most capable data model that — through its ability to be extended with further concepts and relationships (vocabulary) — can create elegant and logical structures to represent comprehensive domains and knowledge bases. Finding such a model has been a quest in my professional life; I believe we finally have a winner to facilitate data interoperability using RDF.

But RDF has not achieved the market acceptance that its suitability as a data representation model might suggest. I think there are three reasons for this:

First, RDF was first presented and “sold” as an XML serialization. This failing has been well understood for some time. This unfortunate early linkage of RDF caused confusion between data model and the XML syntax. The rather simple and incremental building blocks of triple RDF statements when presented in the nested XML syntax led to lengthy and hard-to-read specifications (for easier reading and use, see either the N3 or Turtle syntaxes)
Second, triples by definition are 50% more complicated than a key-value pair. While the basic RDF statement might be simple like a Dick-and-Jane reader, as a data specification format it is still more complex than my personal attributes of sex:Male and hair:Red and born:California. Those three “facts” can not be said nearly so quickly in RDF. And, if we also adhere to linked data, each one of these items requires a URI unique identifier to boot! It is important not to ignore the desire for simple and human readable data-specification formats
Third, as this entry began and as we will conclude, RDF and its fuzzy relationship to ontology has led to over-specification of what needs to be included in the data record. What could simply be a record specification of an object and its attributes presented as simple key-value pairs has become burdened with “ontology” and “conceptual” relationships.

Canonical forms embody all of the specification that the canon guiding them requires. What we may have failed to see in embracing RDF, however, is that getting useful data into the system need not carry all of this burden.

Lightening Up and Shifting Work to the TBox

Click for full-size

So, what does all of this have to do with my starting diatribe about the term ontology?

Whether a single database or the federation across all information known to human kind, we have data records (structs of instances) and a logical schema (ontology of concepts and relationships) by which we try to relate this information. This is a natural and meaningful split: structure and relationships v. the instances that populate that structure.

Stated this way, particularly for anyone with a relational database background, the split between schema and data is clear and obvious. Yet, the RDF, semantic Web and linked data communities have done an abysmal job of recognizing this fundamental separation of concerns.

We create “ontologies” that mix instances and schema. We insist on simple data record conversions that are burdened with relationship specifications as well. We tout a “linked data” infrastructure that is based solely on the same identity of instances without respect or attention to structure or conceptual relationships. We dismiss communities that work to express their data with useful local structures. We insist on standards and practices up and down the data staging and preparation chain that turns off the general market and makes us seem arrogant and dismissive. Frankly, in so many ways, we just don’t get it [3].

What has struck me personally over the past few months as these realizations have unfolded has been how much our own mindsets and language may be trapping us.

Does existing structured data need to be expressed as RDF in order to be useful and integrated?
Exposing linked and instance data is great, but to what end; what are the conceptual or structural schema?
Why is our standards process so inward looking and parochial (often petty)? What purpose or who does this serve?

At least for this diatribe, my essential conclusion is that we need to shift the burden of the schema and conceptual relations and (yes) world views to the TBox. We need to skinny down the ABox and make it a warm and welcoming environment by which any structured data (including the most naïve) can join.

So, ultimately, the bottom line is this: the burden of the semantic Web rests on us, not the providers of structured data.

It is time to streamline the ABox to smooth data contributions, assume as publishers the responsibility for the TBox, and keep those concerns separate. As for instance-related stuff, I now intend to refer to them as structs governed by a controlled vocabulary (at most). I intend to reserve ontology as a means to describe a given world view, a TBox, the schema and its relations of the domain at hand. And, frankly, this definition of ontology brings it back in balance with its roots in ontos and the nature of the world.

It’s a good time to lighten up!

[1] GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a W3C markup format for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT GRDDL accomodates a wide variety of dialects (see one listing) and can be combined with arbitrary transformation mechanisms (though currently mostly based on XSLTs).

[2] Also see the listing of “dynamic” RDFizers at http://esw.w3.org/topic/DynamicRDFizers.

[3] I don’t mean to imply that there are not those in the community interested in lightweight data structures or their conversion, just that they have been more of a minority to date. For example, the 5th Workshop on Scripting and Development for the Semantic Web is coming up this summer in conjunction with the 6th European Semantic Web Conference in Crete, Greece; this year’s organizers are Gunnar Aastrand Grimnes (DFKI Knowledge Management Lab), Chris Bizer (Freie U niversität Berlin) and Sören Auer (U niversität Leipzig). As other examples focusing on JSON, there are a couple of efforts to define representation conventions from Talis and GBV for serializing RDF; Jim Ley, Kanzaki Masahide and Dave Beckett (likely among others) have written simple and straightforward RDF and Turtle parsers and converters; there was a floated idea for an RDF version of JSON called RDFON that has now evolved into the TURF approach; and JDIL (JSON data integration layer) instructs how to add namespaces to JSON to enable encoding RDF. Still further examples are Beckett’s Triplr and Auer’s ASKW Triplify lightweight conversion services involving many different formats. These are all laudable efforts with good relevance to a lighter ABox approach, I think.

Posted:January 18, 2009

Back to the Future with Description Logics

Have Linked Data, Microformats Stumbled into an Adaptive Design?; Benefits from Keeping the TBox and ABox Separate

I was glad to see Kendall Clark pick up on parts of my earlier piece on Thinking ‘Inside the Box’ with Description Logics. He took one point of view in his posting — that I mostly agree with — but I’d also like to reinforce some other thoughts. And, those thoughts are: description logics (DL) provides earlier lessons and insights that our current zeal for linked data should not overlook, and the lessons we can gain from DL are really fundamental and architectural.

For those of you who have not read Kendall’s piece — which I heartily recommend — let me give you my Cliffs Note’s summary: there are those within the semantic Web community that want to capture the conceptual relationships within knowledge and domains, the Maximum Fidelity tribe, and then those that want to link and describe as many things as possible, the Maximum Scalability tribe, with those (like Kendall’s firm, Clark & Parsia) residing in the middle and following the precepts of DL. The theme is that extremes exist and need to be bridged. [1]

Posing these contrasts is an effective way to describe different ideas and approaches, but, like all straw men, perhaps it hides nuances and complexity. And, as I note below, it may also pose the wrong straw man dichotomy.

We Are All Tribes of One

Jim Hendler, for one, took exception to Kendall’s characterization to make the obvious point that different use cases demand different approaches. What was interesting, however, in these interchanges was that a nerve was seemingly struck about differences in viewpoints and approaches. Indeed, the very reference to “tribes” seemed to bring out the (ahem) tribal response.

So, just so we are clear, in what I say below I take on the position of a tribe of one; that is, my own opinion. Of course, this is what all of us do. By positing tribes and viewpoints we simplify what is nuanced and subtly convey that opinions are cultural (“tribal”) and not subject to learning and change. Perhaps within the temporal viewpoint of whatever may be today’s trends and “memes” such thinking may hold, but I fundamentally disagree with such a static view of collective understanding and communities over more meaningful periods of years or decades. But, I digress. . . .

At the risk of being simplistic, I think we can say that there was a rich academic and intellectual history behind description logics going back to the early 1990s [2]. Then, with the seminal semantic Web paper built from thinking in the late 90s by Berners-Lee and published by him and Hendler and Lassila in 2001 in Scientific American [3], a real marker was put down for machine-readable and -actionable data (via “agents”) accessible on the Web. Many have been disappointed at the slow pace of the semWeb’s unfolding and some have blamed and rejected AI and “big” ontologies for this slowness. As usable standards finally emerged, a newer set of acolytes pushed “just getting data out there” and RDF linked data began to assume prominence from about 2006 onward, spearheaded by DBpedia and the linked open data community.

In so many ways we are coming full circle — coming back to the future — in seeing how our new linked data techniques can again benefit from this earlier DL thinking. Rather then poles and spectrums, I think we are experiencing the need to revisit our intellectual past now that workable publishing mechanisms and scalability and organization assume real prominence. Though clearly not intentional, the linked data community (and, in a related way, microformats), may just have stumbled upon a very cool architectural design that can leverage DL precepts.

Some Terminology Revisited

Some of this DL and semWeb terminology can be off-putting. But it is helpful to know the lingo if one wants to look into the technical literature. Though most of this stuff can be described without resorting to such terms and can be readily grasped on an intuitive basis, here are some important grounding terms:

TBox — according to [2], a TBox “contains intensional knowledge in the form of a terminology (hence the term ‘TBox,’ but ‘taxonomy’ could be used as well) and is built through declarations that describe general properties of concepts. Because of the nature of the subsumption relationships among the concepts that constitute the terminology, TBoxes are usually thought of as having a lattice-like structure; this mathematical structure is entailed by the subsumption relationship — it has nothing to do with any implementation.” A TBox uses a controlled vocabulary to define the concepts and roles of a domain of interest and the relations or properties amongst them
ABox — according to [2], an ABox “contains extensional knowledge — also called assertional knowledge (hence the term ‘ABox’) — that is specific to the individuals of the domain of discourse.” An ABox provides the concept and role membership assertions for instance data, as well as assertions or “facts” about the attributes of those instances using the same controlled vocabulary as defined in the TBox
First-order logic (FOL) — is a formal deductive system with unambiguous logic and mathematical structures for declaring, testing and inferring propositions (statements) and predicates (relations). A first-order theory consists of a set of axioms (usually finite or recursive) and the statements deducible from them based on FOL’s base logical axioms (such as the operators found in classical set theory, which itself is built on FOL)
Description logics (DL) — are any of a family of knowledge representation languages that can be translated and characterized according to first-order logic. A DL language has a syntax that consists of unary predicate symbols to denote concepts, binary relations to denote roles, and recursion. DL semantics define concepts as sets of individuals and roles as sets of pairs of individuals. The expressivity of a DL language is a function of the logical operators the language supports (shown with representations such as $\mathcal{SROIQ}^\mathcal{(D)}$ , the expressiveness of OWL 2). DL languages can be translated into other DL languages that support the same expressivity, regardless of syntax, but more expressive languages can not be equivalently represented by less expressive ones. The current OWL dialects of OWL Lite and OWL DL are DL languages
Axiom — in traditional logic or FOL, an axiom (also called a ‘postulate’) is a proposition that is not proved or demonstrated but considered to be either self-evident or consistent with the base logic of the system. As such, its truth is taken for granted and the axiom serves as a starting point for deducing and inferring other (theory-dependent) truths
Intensional — is a form of set membership that is based on the propositions and concepts which defines the set; there may be many possible members that remain unenumerated so long as they meet the conditions for membership. The intensional principal judges objects to be a member based on the properties or conditions they must have
Extensional — is a form of set membership that arises from (“extends”) its listed set members. The extensional principle judges objects to be a member if they have the same external characteristics (whether as explicitly defined properties or not)
Ontology — as used in knowledge representation or information science, this term is most often defined using Tom Gruber’s “explicit specification of a conceptualization” [4]. In practice on the semantic Web, it is any defined schema or data record structure including the most lightweight controlled vocabularies and structures (such as microformats). In DL, both ABox and TBox specifications and statements are lumped under the term
Vocabulary — also ‘controlled vocabulary,’ is an organized, variously structured set of terms used for information retrieval or characterization. In its simplest form, a controlled vocabulary is merely a list for checking possible matches for set membership or not; at its more complex, it is the set of terms contained within a detailed and specified ontology or schema with formalized (axiomatized) relationships
Knowledge base — in the DL community, a knowledge base is simply defined as TBox + Abox. In other words, a knowledge base is a logical schema of roles and concepts and the relationships between them (the TBox) as populated by the actual data (instances) asserting memberships and attributes (“facts”) (the ABox).

TBox v ABox: Different Purposes and Roles

Within description logics and for our purposes herein, the two concepts we will most focus upon are the ABox and the TBox. As the definitions above suggest, the TBox is more structural and reflects the logical and conceptual relationships within a domain; that is, the role and concept and class relationships. The ABox provides the data (instance) records and characterizations within that schema; that is the instances and facts assertions. By analogy, in a conventional relational database system, the database or logical schema would correspond to the TBox; the actual data records or tables would correspond to the ABox.

These distinctions suggest very different purposes and roles, then, for the TBox and the ABox:

TBox

ABox

Definitions of the concepts and properties (relationships) of the controlled vocabulary
Declarations of concept axioms or roles
Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property
Equivalence testing as to whether two classes or properties are equivalent to one another
Subsumption, which is checking whether one concept is more general than another
Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept)
Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts
Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox

Membership assertions, either as concepts or as roles
Attributes assertions
Consistency checking of instances
Entailments, which are whether other propositions are implied by the stated condition
Satisfiability checks, which are that the conditions of instance membership are met
Infer property assertions implicit through the transitive property
Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept
Knowledge base consistency, which is to verify whether all concepts admit at least one individual
Realization, which is to find the most specific concept for an individual object
Retrieval, which is to find the individuals that are instances of a given concept

While certainly many of the ABox tests and checks require TBox structure, there is a pretty clear separation of purpose and role. Moreover: 1) the scale of the information in each “box” is vastly different (perhaps a few to hundreds to at most thousands of concepts in the TBox in contrast to potentially millions or more instances in ABoxes); and 2) ABox dataset repositories may also be (indeed, often are!) numerous, spatially distributed and semantically heterogeneous.

The Wisdom of Separating Concerns

DL and semantic Web stuff in general are data and logic models, not architectural guidance. So, rarely does one see discussion of the architectural imperatives that some of these logical underpinnings provide. We see knowledge bases and ontologies both used as umbrella terms encompassing both the ABox and the TBox.

However, our own deployment experiences and the literature suggest there are manifest advantages to keeping the TBox and ABox separate:

Advantages of Keeping the TBox and ABox Separate

Better performance by keeping inference and reasoning purposes separate
Better scalability through separation of function
Use of tailored reasoners and rules engines based on purpose [5]
More modular design, including keeping attribute information separate from structural and conceptual relationships
Faster, global instance checking using summary ABox tests [6]
Assignment of named entities (instances) to distinct and disjoint super types [7] that can bring significant tableaux benefits to ABox reasoning
Easier partitioning of ABoxes [8]
Easy swapping in and mix-and-matching of varied, multiple and private or public named entity dictionaries (ABoxes)
Integration with extant relational (RDBMs) data structures and data stores for instance (ABox) data [9]
Integration with other lightweight structures (microformats, other) for instance (ABox) data
Faster retrievals via TBox routing to appropriate ABoxes
Simpler ABox vocabularies that are easier to understand and extend (including continued reliance on RDF and RDFS)
More capable TBox ontologies, including integration of rules systems
Relatively easy extension of the TBox schema ontology into specific domains
Easy ABox data entry and updating via wiki or sematic wiki
Ability to triangulate between separate concept (TBox) and instance (ABox) disambiguation approaches to improve overall precision and recall.

It would be useful to refrain from lumping the very different purposes of ABoxes and TBoxes under the umbrella rubric of ‘ontology’. It would also be useful for designers and vocabulary authors to be more explicit in their own minds as to purpose and content when formulating new ontologies. Smushing all of these concepts into one bubbling mess may not lead to clarity nor good performance.

A Simple Schematic of Best Practice

Taking these basic ideas we can visualize a general schematic for best practice splits within the ontology or knowledge base:

The TBox is clearly focused on the domain at hand, but also includes links and equivalents to external ontologies. The TBox level should be entirely free of instance data, though all attributes, properties and concepts that might be found at the ABox level are also defined with their relationships at the TBox level. Like any semWeb ontology, this TBox level should also re-use common Web ontologies such as FOAF, SIOC, UMBEL, etc.

It is also the case that because of the reasoning needs at the TBox layer, the semantic Web language used should likely be a dialect of OWL (see below).

(BTW, for my own practice, I will try to limit my use of the ‘ontology’ term to the concepts and classes at this TBox level.)

The ABox level, in contrast, may consist of multiple datasets and name spaces. These structures are most appropriately seen as lightweight controlled vocabularies with limited structure; if written by scratch perhaps limited to RDF or RDFS (the schema variant). This layer, however, can also remain in non-semWeb native form — such as RDBMS data tables, microformats or other formats — that are wrapperized for interoperability through one or more ‘RDFizers‘ or GRDDL.

These structures should likely not make many external assertions, if any, and if done, perhaps in separate mapping or linkage file that can be processed and analyzed independently. It is important, however, to make sure that all attributes at this ABox layer have a counterpart with relationships and structure defined at the TBox layer.

This architectural design enables complete independence of the instance datasets from the inferencing logic or federation that might be applied to them.

The Relevance to Linked Data Instances

Since it first took off in 2006, linked data and the various datasets now shown in the ‘LOD cloud‘ have been dominated by instance data. There are perhaps 10 million to 20 million instance objects available as linked data, many of which are derived from Wikipedia (via DBpedia) with attributes or structure coming from the Wikipedia infoboxes.

“There need not be a trade-off between expressiveness and scalability. Proper design, language choice and architecture can readily achieve both — while maintaining independence of scope or purpose.”

A similarly fast explosion has taken place with structured records via microformats and other simple data structures. For example, some earlier estimates suggest there are perhaps more than 2 billion pages that include microformats [10].

I have at times recently made comments about the dominance of instance data within the linked data community and the need for organizing structure. While this observation, I believe, remains true and provides a rationale for UMBEL as an organizing subject structure (or any other organizing structure, for that matter), perhaps I have been missing a more fundamental point: linked data (at least as practiced to date) is really about exposing ABoxes with simple structure. Perhaps, by serendipity, linked data (and other light structures like microformats) are showing the way to a distributed, mixed ABox-TBox structure for the Web.

With this altered viewpoint, a number of new observations emerge:

Linked data instance structures perhaps need to be consciously designed as such, with lightweight structure and limited external (OWL-based or TBox-oriented) class structure
The recent interest shown in so-called VoCamps (for lightweight vocabulary development) and voiD (vocabulary of interlinked datasets) might be usefully viewed specifically through an ABox “lens” with perhaps best practices for structures and vocabulary syntaxes to emerge
TBox class and relationship structures should remain apart from the instance data and can be used to operate independently of the datasets
Architectural and design changes might also emerge through clear separation of ABox and TBox that can benefit semantic Web scalability.

Linked data and microformats and other lightweight structures are now giving us the exposed instance data to begin reasoning and showing differences due to inferencing and other logic advantages for the semantic Web. Now that the ABox is being proven, let’s move on and stress-test the TBox!

OWL 2 and Query Rewriting

Since the first version of OWL there has been confusion and some limitations with the dialects of the language. Only OWL Full allowed classes to be treated both as instances and classes (so-called metamodeling), and was therefore used as the basis for mapping UMBEL, for example, to RDF and RDFS vocabularies and to Cyc. This design was necessary, but left UMBEL undecidable using standard DL reasoners; only the two dialects of OWL DL and OWL Lite met description logics requirements.

Indeed, it was even hard to determine what dialect an OWL file represented, among many other problems and issues. The technical committee behind OWL 2, in fact, has written an excellent critique of issues with this first version of OWL [11].

For nearly two years the next version of OWL, OWL 2, has been undergoing development, with the last draft now published and available for last comment before January 23 [12]. Lessons and refinements to the use of DL have also occurred. Some have criticized this effort and have criticized the need for OWL 2’s growing expressiveness and vocabulary [see 1, for example]. I believe these criticisms to be unfair and to miss many of the thoughtful improvements in this new version.

Version control and expressiveness are two of these benefits. A broader benefit, though, has been the keen attention the developers have given to compliance with description logics and the ability to formulate fragments (called “profiles”) that only present subsets of DL useful for computational considerations [13]. For example, one profile, OWL 2 QL, appears well suited to the ABox; another, EL, appears well suited to the TBox. Users and tools builders may define other subsets of OWL 2 to deal with different use cases.

What is emerging are possible design patterns that would have comprehensive TBox guidance and inference structures that first receive a query, then do query rewriting for less capable OWL dialects and mapping to distributed ABox datasets, some of which might be kept in native relational DB or other structural forms [9, 14, 15]. Other approaches and designs, such as overviewed for DLDB2, KAON2, OWLIM, BigOWLIM and Minerva, are testing other architectural and DL combinations [see 15]. And, at the level of the specific triplestore, other optimizations are being made such as owl:sameAs or query rewrite with Virtuoso [16]. This new version of OWL and its profiles have adapted to past lessons and can be matched well to the emerging hardware and architectural designs.

These changes appear to now provide the option for various dialects of OWL to be matched with reasoners and architectural designs in order to optimize for different purposes. Rather than a spectrum, we appear to be learning and maturing. Hopefully, getting back to the architectural implications of the TBox – ABox split can show us there need not be a trade-off between expressiveness and scalability. Proper design, language and dialect choice, and architecture can readily achieve both — while maintaining independence of scope or purpose.

Thanks, OWL 2! You have fulfilled your commitment to description logics. It is now our turn to figure out the best practices for working with these tools.

[1] For those with a spare 90 minutes or so, you may also want to view this panel session and debate that took place on “An OWL 2 Far?” at ISWC ’08 in Karlsruhe, Germany, on October 28, 2008. The panel was chaired by Peter F. Patel-Schneider (Bell Labs, Alcathor) with the panel members of Stefan Decker (DERI Galway), Michel Dumontier (Carleton University), Tim Finin (University of Maryland) and Ian Horrocks (University of Oxford), with much audience participation. See http://videolectures.net/iswc08_panel_schneider_owl/.

[2] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. See Chapter 1. Sample chapters may be viewed from Enrico Franconis Description Logics course notes and tutorial at http://www.inf.unibz.it/~franconi/dl/course/, which is an excellent starting reference point on the subject.

[3] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web,” in Scientific American, 284(5):34-43, May 2001; see http://www.sciam.com/article.cfm?id=the-semantic-web.

[4] Thomas R. Gruber, 1993. “A Translation Approach to Portable Ontology Specifications,” in Knowledge Acquisition 5(2): 199-220; see http://tomgruber.org/writing/ontolingua-kaj-1993.pdf.

[5] Georgios Meditskos and Nick Bassiliades, 2008. “Combining a DL Reasoner and a Rule Engine for Improving Entailment-based OWL Reasoning,” presented at the 7th International Semantic Web Conference (ISWC2008); see http://lpis.csd.auth.gr/publications/med-iswc08.pdf.

[6] Achille Fokoue, Aaron Kershenbaum, Li Ma, Edith Schonberg, and Kavitha Srinivas, 2006. “The Summary Abox: Cutting Ontologies Down to Size,” presented at the 5th International Semantic Web Conference, Athens, GA, USA, November 5-9, 2006; see http://iswc2006.semanticweb.org/items/Kershenbaum2006qo.pdf.

[7] These are akin to the lexicographer supersenses that have been applied in WordNet for nouns and verbs (though only nouns are used here). See Massimiliano Ciaramita and Mark Johnson, 2003. Supersense Tagging of Unknown Nouns in WordNet, in Proceedings of the Conf. on Empirical Methods in Natural Language Processing, pp. 168173, 2003. See http://www.aclweb.org/anthology-new/W/W03/W03-1022.pdf.

[8] Yuanbo Guo and Jeff Heflin, 2006. “A Scalable Approach for Partitioning OWL Knowledge Bases,” in Proceedings of the 2nd International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS2006), Athens, Georgia, USA, November, 2006; see http://swat.cse.lehigh.edu/pubs/guo06c.pdf.

[9] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini and Riccardo Rosati, 2007. “Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family,” in Journal of Automated Reasoning, 39 (3), October 2007; see http://www.dis.uniroma1.it/~degiacom/papers/2007/calv-etal-JAR-2007.pdf.

[10] The original citation could not be found, but it is referenced on the Microformats mailing list on Oct. 1, 2008; see http://microformats.org/discuss/mail/microformats-discuss/2008-October/012550.html.

[11] Bernardo Cuenca Grau, Ian Horrocks, Boris Motik, Bijan Parsia, Peter Patel-Schneider and Ulrike Sattler, 2008. “OWL2: The Next Step for OWL,” in Journal of Web Semantics, 6(4): 309-322, November 2008; see http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2008/CHMP+08.pdf.

[12] Boris Motik, Peter F. Patel-Schneider, Bijan Parsia, eds., 2008. OWL 2 Web Ontology Language: Structural Specification and Functional-Style Syntax, W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-syntax/.

[13] Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue and Carsten Lutz, eds., 2008. OWL 2 Web Ontology Language: Profiles, W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-profiles/.

[14] Luciano Serafini and Andrei Tamilin, 2007. “Instance Migration in Heterogeneous Ontology Environments,” in Proceedings of 6th International Semantic Web Conference / 2nd Asian Semantic Web Conference (ISWC/ASWC 2007), pages 452-465, 2007. See http://iswc2007.semanticweb.org/papers/449.pdf.

[15] Zhengxiang Pan, Xingjian Zhang and Jeff Heflin, 2008. “DLDB2: A Scalable Multi-Perspective Semantic Web Repository,” in WI 08: Proceedings of the International Conference on Web Intelligence, IEEE Computer Society Press, pp. 489-495; see http://swat.cse.lehigh.edu/pubs/pan08a.pdf.

[16] Orri Erling and Ivan Mikhailov, 2007. “RDF Support in the Virtuoso DBMS,” in Proceedings of the 1st Conference on Social Semantic Web, Leipzig, Germany, Sep 26-28, 2007; see http://aksw.org/cssw07/paper/5_erling.pdf.

Posted:January 14, 2009

Dead Media Walking

This is the post that sums up the transition that is 2009 to come:

http://www.shore.com/commentary/weblogs/2009/01/dead-business-models-walking-will-major.html

So, this brings to mind a couple of thoughts:

Linked Open Data community — stop talking only with yourselves, and get active helping to spread the word about benefits and approaches to the broader market; see if you can do so without mentioning any technical terms
Media and content enterprises of all stripes — extracting structure from your content is where it is at; if you don’t do it, someone else will.

We are now being buffeted by the biggest tsunami in creative destruction that has occurred at least within the past 50 years. The Web provides the key to the emerging opportunities. So, too, does structured data and the ability to aggregate it.

Oh, and by the way. All of you start-ups hoping to make it through an ad-based revenue model?: Bend over and kiss it goodbye.

I actually kind of like this environment.

Main Links

Search

Author: Mike Bergman