To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results. This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, published an opinion piece in the March/April 2009 issue of IEEE Intelligent Systems titled, ‘The Unreasonable Effectiveness of Data.’ It provides important framing and hints for what next may emerge in semantics from the Google search engine.
I had earlier covered Halevy and Google’s work on the deep Web. In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.
Some of the research they cite is related to WebTables  and similar efforts to extract structure from Web-scale data. The authors describe the use of such systems to create ‘schemata’ of attributes related to various types of instance records — in essence, figuring out the structure of ABoxes , for leading instance types such as companies or automobiles .
These observations, which they call the semantic interpretation problem and contrast with the Semantic Web, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis: “Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.”
Google had earlier posted their 1 terabyte database of n-grams, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages. The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances. (Good news for the rest of us!)
As the authors challenge:
My very strong suspicion is that we will see — and quickly — much more structured data for instance types (the ‘ABox’) rapidly emerge from Google in the coming weeks. They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.
The structured Web is growing all around us like stalagmites in a cave!
This post tells you how to easily modify your Firefox browser to enable constant date-filtered searching in Google. This powerful and useful feature is not generally known as even available on the Google engine, and even then is not easily used from within Google. BTW, if you want to cut to the chase about what I recommend, see Option 3 below.
I’m a research hound and do much monitoring of Web sites and queries. I more often than not have too many tabs open in my Firefox browser and I am an aggressive user of Firefox’s integrated Web search, the Search Bar, which is found to the right of the location bar (see image to left).
In monitoring mode, I often want to see what is new or recently updated. I also find it helpful to filter results by date stamp in order to find the freshest stuff or to find again that paper I know got published a couple of years back.
The image below shows this date search option on the advanced Google search form. (Advanced search is always available as a link option to the right of the standard Google search box.) After expanding the ‘Date, usage rights, numeric range, and more‘ section and then picking a Date option from the dropdown, we see:
The dropdown presents only a few options — though helpful — of anytime, past 24 hours, past week, past month and past year with respect to the choices for date filtering.
(BTW, the Web is notorious for not having good date stamp conventions or uniform application of same. As best as I can tell, Google generally date stamps its specific Web pages — URLs — when first indexed in its system. Later minor updates do not apparently result in a date stamp update. However, some highly visited and dynamic sites, such as for example CNN, do seem to have date stamps updated as of time of most recent harvest. This seems to be an issue more for restricting searches to, say, the past 24 hours, involving news or high-traffic sites. Date stamps seem to work pretty well for “normal” Web pages and sites.)
One obvious real problem is the number of clicks it takes to invoke this date option. Another problem is the limited choices available once so done.
Can’t we bring a little bit of automation and control to this process?
And, indeed we can, as the next three options discuss.
Search geeks learn to look to the tips or search conventions for a search engine or to learn the “codes” embedded in the search string URL to discover other “hidden” search goodies. While sometimes overlooked at the bottom of its search page, Google has some helpful search features and advanced search tips that are worthwhile for the serious searcher to study.
The search query string submitted via URL is also worth inspecting. For example, for the screen capture above, which is searching on my standard handle of ‘mkbergman‘, this is what gets issued to Google:
Quick inspection of this search string shows us a couple of things.
First, in this case, different variables are declared and separated with the ampersand ( ‘&‘) character. If there is a parameter assignment for that variable, it occurs after the equals (‘=‘) sign. The actual query issued is denoted by ‘q‘. The number of results I asked for is defined by ‘num‘ as 100, and so on.
From the standpoint of the date stamp, the variable we are interested in is ‘as_qdr‘ (which stands for, I think, advanced search, query date range, though Google does not publicly define this as far as I know). In this case, we have set our parameter to ‘y‘, equivalent to what our label above indicates as past year.
The easiest way to use this command is to simply append it (such as ‘&asqdr=y‘) to the end of the query URL in your location bar; here is an example for my recently started company looking for pages in the past year:
These search variable codes and what the site accepts are sometimes defined, but generally on the search engine’s Web site. For this Google variable, I first came across this date variable explication from Matt Cutts (who was picking up on a tip from Alex Chitu, though apparently the command goes back to at least 2003), which explained you can:
d[number] – past number of days (e.g.: d10)
w[number] – past number of weeks
y[number] – past number of years
So, to change the search to the past 5 days change the =d to =d5 for 5 days, or =w10 for ten weeks, or =y3 for 3 years! A nice aspect is that the dropdown date search form remains on the page for subsequent searches during the current session.
Well, if you are like me and this kind of functionality would be of constant benefit, why not make it part of your standard search profile?
The first consideration is that I want the date search box to always appear, but I don’t know in advance the period I want to specify. My first thought was to somehow specify the anytime parameter, but I could not figure it out. The next option I tested was to up the year period to an extremely long time such that all results on the service were standardly shown. That way, I could winnow back from there.
After testing I found that the system would not accept 20 years, but does accept 19! That seems to cover the Web history nicely, so I now had a standard parameter to add to get the date search box to appear: &as_qdr=y19
The next consideration was where to specify this standard. My first attempt was to replace the standard search that takes place in the location bar. To do so, there is an internal Firefox convention for getting at internal settings, about:config, which you enter in the location bar. Once entered, you also get a scary (and amusing) warning message, as this screen shot shows:
If you proceed on from there you get a massive listing of internal settings, listed alphabetically. The one we are interested in is keyword.URL, which can be obtained by entering in the Filter box or scrolling down the long list. Once found, right click on the listing and choose Modify:
That enables you to change the default string value. So, enter the standard &as_qdr=y19 right before the query (‘&q=‘) variable at the end of the string:
Of course, you can specify any parameter of your own choosing if you want the default behavior to be different.
This location bar option generally works well, but for many searches initiated from the location bar Google may take you to only a single result based on its ‘I’m Feeling Lucky’ option. That may not be the behavior you want; it wasn’t for me.
My preferred approach is to treat date searching just as another specialty search engine. Thus, that suggests changing the search engine parameters that occur in the Firefox Search Bar (see first image on this page).
Default search engines are stored in the Firefox installation directory’s “searchplugins” folder. (Newly-installed search engines are stored in the Firefox profile folder “searchplugins” folder, so you can have search engine files in both locations.) Search engines are stored as single *.xml files (e.g., google.xml in our specific case) in Firefox 2 and above.
So, to add our new date search filter, you will need to add this line (or whatever your specific variant might be) to your google.xml file:
After saving, quit and re-start Firefox.
And then, voilÃƒÆ’Ã†â€™Ãƒâ€ Ã¢â‚¬â„¢ÃƒÆ’Ã¢â‚¬Â ÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢! We now have a standard date filter on each and every one of our Google searches initiated from the Search Bar !:
The important message, of course, to all of this is the admonition: Know Thy Tools!
A bit over a year ago I spotlighted Collex, a set of tools for COLLecting and EXhibiting information in the humanities. Collex was developed for the NINES project (which stands for the Networked Infrastructure for Nineteenth-century Electronic Scholarship, a trans-Atlantic federation of scholars). Collex has now spawned Blacklight, a library faceted browser and discovery tool.
Blacklight is intended as a general faceted browser with keyword inclusion for use by libraries and digital collections. As with Collex, Blacklight is based on the Lucene/Solr facet-capable full-text engine. The name Blacklight is based on the combination of Solr + UV(a).
Blacklight is being prototyped on UVa’s Digital Collections Repository. It was first shown at the 2007 code4lib meeting, but has recently been unveiled on the Web and released as an open source project. More on this aspect can be found at the Project Blacklight Web site.
Blacklight was developed by Erik Hatcher, the lead developer of Flare and Collex, with help from library staff Bess Sadler, Bethany Nowviskie, Erin Stalberg, and Chris Hoebeke. You can experiment yourself with Blacklight at: http://blacklight.betech.virginia.edu/.
The figure below shows a typical output. Various pre-defined facets, such as media type, source, library held, etc., can be combined with standard keyword searches.
Many others have pursued facets, and the ones in this prototype are not uniquely interesting. What is interesting, however, is the interface design and the relative ease of adding, removing or altering the various facets or queries to drive results:
An extension of this effort, BlacklightDL, provides image and other digital media support to the basic browser engine. This instance, drawn from a separate experiment at UVa, shows a basic search of ‘Monticello’ when viewed through the Image Gallery:
Like the main Blacklight browser, flexible facet selections and modification are offered. With the current DL prototype, using similar constructs from Collex, there are also pie chart graphics to show the filtering effects of these various dimensions (in this case, drilling down on ‘Monticello’ by searching for ‘furniture’):
BlacklightDL is also working in conjunction with the OpenSource Connections (a resource worth reviewing in its own right).
Blacklight has just been released as an open source OPAC (online public access catalog). That means libraries (or anyone else) can use it to allow people to search and browse their collections online. Blacklight uses Solr to index and search, and uses Ruby on Rails for front-end configuration. Currently, Blacklight can index, search, and provide faceted browsing for MaRC records and several kinds of XML documents, including TEI, EAD, and GDMS; the code is available for downloading here.
There is a rich and relatively long history of faceted browsing in the humanities and library science community. Notably, of course is Flamenco, one of the earliest dating from 2001 and still active, to MIT’s SIMILE Exhibit, which I have written of numerous times. Another online example is Footnote, a repository of nearly 30 million historical images. It has a nice interface and an especially nifty way of using a faceted timeline. Also see Solr in Libraries from Ryan Eby.
In fact, faceted browsing and search, especially as it adapts to more free-form structure, will likely be one of the important visualization paradigms for the structured Web. (It is probably time for me to do a major review of the area. )
The library and digital media and exhibits communities (such as museums) are working hard at the intersection of the Web, search, display and metadata and semantics. For example, we also have recently seen the public release of the Omeka exhibits framework from the same developers of Zotero, one of my favorite Firefox plug-ins. And Talis continues to be a leader in bringing the semantic Web to the library community.
The humanities and library/museum communities have clearly joined the biology community as key innovators of essential infrastructure to the semantic Web. Thanks, community. The rest of us should be giving more than a cursory wave to these developments.
* * *
BTW, I’d very much like to thank Mark Baltzegar for bringing many of these initiatives to my attention.
If Everyone Could Find These Tools, We’d All be Better Off
About a month ago I announced my Jewels & Doubloons awards for innovative software tools and developments, most often ones that may be somewhat obscure. In the announcement, I noted that our modern open-source software environment is:
“… literally strewn with jewels, pearls and doubloons — tremendous riches based on work that has come before — and all we have to do is take the time to look, bend over, investigate and pocket those riches.”
That entry begged the question of why this value is often overlooked in the first place. If we know it exists, why do we continue to miss it?
The answers to this simple question are surprisingly complex. The question is one I have given much thought to, since the benefits from building off of existing foundations are manifest. I think the reasons for why we so often miss these valuable, and often obscure, tools of our profession range from ones of habit and culture to weaknesses in today’s search. I will take this blog’s Sweet Tools listing of 500 semantic Web and -related tools as an illustrative endpoint of what a complete listing of such tools in a given domain might look like (not all of which are jewels, of course!), including the difficulties of finding and assembling such listings. Here are some reasons:
Search Sucks — A Clarion Call for Semantic Search
I recall late in 1998 when I abandoned my then-favorite search engine, AltaVista, for the new Google kid on the block and its powerful Page Ranking innovation. But that was tens of billions of documents ago, and I now find all the major search engines to again be suffering from poor results and information overload.
Using the context of Sweet Tools, let’s pose some keyword searches in an attempt to find one of the specific annotation tools in that listing, Annozilla. We’ll also assume we don’t know the name of the product (otherwise, why search?). We’ll also use multiple search terms, and since we know there are multiple tool types in our category, we will also search by sub-categories.
In a first attempt using annotation software mozilla, we do not find Annozilla in the first 100 results. We try adding more terms, such as annotation software mozilla “semantic web”, and again it is not in the first 100 results.
Of course, this is a common problem with keyword searches when specific terms or terms of art may not be known or when there are many variants. However, even if we happened to stumble upon one specific phrase used to describe Annozilla, “web annotation tool”, while we do now get a Google result at about position #70, it is also not for the specific home site of interest:
Now, we could have entered annozilla as our search term, assuming somehow we now knew it as a product name, which does result in getting the target home page as result #1. But, because of automatic summarization and choices made by the home site, even that description is also a bit unclear as to whether this is a tool or not:
Alternatively, had we known more, we could have searched on Annotea Mozilla and gotten pretty good results, since that is what Annozilla is, but that presumes a knowledge of the product we lack.
Standard search engines actually now work pretty well in helping to find stuff for which you already know a lot, such as the product or company name. It is when you don’t know these things that the weaknesses of conventional search are so evident.
Frankly, were our content to be specified by very minor amounts of structure (often referred to as “facets”) such as product and category, we could cut through this clutter quickly and get to the results we wanted. Better still, if we could also specify only listings added since some prior date, we could also limit our inspections to new tools since our last investigation. It is this type of structure that characterizes the lightweight Exhibit database and publication framework underlying Sweet Tools itself, as its listing for Annozilla shows:
The limitations of current unstructured search grow daily as Internet content volumes grows.
We Don’t Know Where to Look
The lack of semantic search also relates to the problem of not knowing where to look, and derives from the losing trade-offs of keywords v. semantics and popularity v. authoritativeness. If, for example, you look for Sweet Tools on Google using “semantic web” tools, you will find that the Sweet Tools listing only appears at position #11 with a dated listing, even though arguably it has the most authoritative listing available. This is because there are more popular sites than the AI3 site, Google tends to cluster multiple site results using the most popular — and generally, therefore, older (250 v. 500 tools in this case!) — page for that given site, and the blog title is used in preference to the posting title:
Semantics are another issue. It is important, in part, because you might enter the search term product or products or software or applications, rather than ‘tools‘, which is the standard description for the Sweet Tools site. The current state of keyword search is to sometimes allow plural and single variants, but not synonyms or semantic variants. The searcher must thus frame multiple queries to cover all reasonable prospects. (If this general problem is framed as one of the semantics for all possible keywords and all possible content, it appears quite large. But remember, with facets and structure it is really those dimensions that best need semantic relationships — a more tractable problem than the entire content.)
We Don’t Have Time
Faced with these standard search limits, it is easy to claim that repeated searches and the time involved are not worth the effort. And, even if somehow we could find those obscure candidate tools that may help us better do our jobs, we still need to evaluate them and modify them for our specific purposes. So, as many claim, these efforts are not worth our time. Just give me a clean piece of paper and let me design what we need from scratch. But this argument is total bullpucky.
Yes, search is not as efficient as it should be, but our jobs involve information, and finding it is one of our essential work skills. Learn how to search effectively.
The time spent in evaluating leading candidates is also time well spent. Studying code is one way to discern a programming answer. Absent such evaluation, how does one even craft a coded solution? No matter how you define it, anything but the most routine coding tasks requires study and evaluation. Why not use existing projects as the learning basis, in addition to books and Internet postings? If, in the process, an existing capability is found upon which to base needed efforts, so much the better.
The excuse of not enough time to look for alternatives is, in my experience, one of laziness and attitude, not a measured evaluation of the most effective use of time.
Concern Over the Viral Effects of Certain Open Source Licenses
Enterprises, in particular, have legitimate concerns in the potential “viral” effects of mixing certain open-source licenses such as GPL with licensed proprietary software or internally developed code. Enterprise developers have a professional responsibility to understand such issues.
That being said, my own impression is that many open-source projects understand these concerns and are moving to more enlightened mix-and-match licenses such as Eclipse, Mozilla or Apache. Also, in virtually any given application area, there is also a choice of open-source tools with a diversity of licensing terms. And, finally, even for licenses with commercial restrictions, many tools can still be valuable for internal, non-embedded applications or as sources for code inspection and learning.
Though the license issue is real when it comes to formal deployment and requires understanding of the issues, the fact that some open source projects may have some use limitations is no excuse to not become familiar with the current tools environment.
We Don’t Live in the Right Part of the World
Actually, I used to pooh-pooh the idea that one needed to be in one of the centers of software action — say, Silicon Valley, Boston, Austin, Seattle, Chicago, etc. — in order to be effective and on the cutting edge. But I have come to embrace a more nuanced take on this. There is more action and more innovation taking place in certain places on the globe. It is highly useful for developers to be a part of this culture. General exposure, at work and the watering hole, is a great way to keep abreast of trends and tools.
However, even if you do not work in one of these hotbeds, there are still means to keep current; you just have to work at it a bit harder. First, you can attend relevant meetings. If you live outside of the action, that likely means travel on occasion. Second, you should become involved in relevant open source projects or other dynamic forums. You will find that any time you need to research a new application or coding area, that the greater familiarity you have with the general domain the easier it will be for you to get current quickly.
We Have Not Been Empowered to Look
Dilbert, cubes and big bureaucracies aside, while it may be true that some supervisors are clueless and may not do anything active to support tools research, that is no excuse. Workers may wait until they are “empowered” to take initiative; professionals, in the true sense of the word, take initiative naturally.
Granted, it is easier when an employer provides the time, tools, incentives and rewards for its developers to stay current. Such enlightened management is a hallmark of adaptive and innovative organizations. And it is also the case that if your organization is not supporting research aims, it may be time to get that resumÃƒÆ’Ã†â€™Ãƒâ€ Ã¢â‚¬â„¢ÃƒÆ’Ã¢â‚¬Â ÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢ Ãƒâ€šÃ‚Â© up to date and circulated.
But knowledge workers today should also recognize that responsibility for professional development and advancement rests with them. It is likely all of us will work for many employers, perhaps even ourselves, during our careers. It is really not that difficult to find occasional time in the evenings or the weekend to do research and keep current.
If It’s Important, Become an Expert
One of the attractions of software development is the constantly advancing nature of its technology, which is truer than ever today. Technology generations are measured in the range of five to ten years, meaning that throughout an expected professional lifetime of say, about 50 years, you will likely need to remake yourself many times.
The “experts” of each generation generally start from a clean slate and also re-make themselves. How do they do so and become such? Well, they embrace the concept of lifelong learning and understand that expertise is solely a function of commitment and time.
Each transition in a professional career — not to mention needs that arise in-between — requires getting familiar with the tools and techniques of the field. Even if search tools were perfect and some “expert” out there had assembled the best listing of tools available, they can all be better characterized and understood.
It’s Mindset, Melinda!
Actually, look at all of the reasons above. They all are based on the premise that we have completely within our own lights the ability and responsibility to take control of our careers.
In my professional life, which I don’t think is all that unusual, I have been involved in a wide diversity of scientific and technical fields and pursuits, most often at some point being considered an “expert” in a part of that field. The actual excitement comes from the learning and the challenges. If you are committed to what is new and exciting, there is much room for open-field running.
The real malaise to avoid in any given career is to fall into the trap of “not enough time” or “not part of my job.” The real limiter to your profession is not time, it is mindset. And, fortunately, that is totally within your control.
Gathering in the Riches
Since each new generation builds on prior ones, your time spent learning and becoming familiar with the current tools in your field will establish the basis for that next change. If more of us had this attitude, the ability for each of us to leverage whatever already exists would be greatly improved. The riches and rewards are waiting to be gathered.
We Are All Groping for Better and More Authoritative Information
NOTE: After my release of SweetSearch, a Google custom search engine (CSE) devoted to the semantic Web, I have been having numerous offline conversations as to why and the benefits of a CSE. Reproduced below is one of the threads from these email conversations.
Hi Mike -
Forgive my ignorance, but I’m trying to understand the CSE concept. I get that building one enables you to have Google search specific sites for you, but given the intended scope of a general Google search, isn’t the result of a CSE a more limited one, given that it only looks at what you specify? Also, how would a CSE stay current – other than through the manual addition of sites to include?
Thanks in advance.
Thanks for your question.
In my opinion, there are three compelling reasons for a CSE:
So, yes, while a CSE has fewer results than a standard Google search, it is smaller by eliminating spurious results that don’t meet those conditions above. Correct and smaller is always better than larger and indiscriminant, isn’t it?
Yes. Correct and smaller IS better than bigger and more indiscriminant. Thanks. So it seems that a CSE is more a shortcut for looking where you know, than it is a tool for discovery (in places you don’t know).
It would be interesting if there were a way to draw “clusters” from CSE results, and search within those “clusters” of a general (non CSE) search – either one-off or as a way to fine-tune or update the sites to include in a CSE.
Hi Michael -
I hope you had a chance to think about the brainstorm above. My partner and I have a couple more thoughts – in particular, the way it is now, the scalability relies on the crowd participating to fine tune and grow it. Introducing automation (like what I described above) to the manual part could produce results that are greater than those that either one alone could achieve – don’t you think? Could this be a way of extending/propagating the Semantic Web?
You’re obviously poking at the same issues that got me experimenting with CSEs in the first place. Let me offer some further thoughts.
First, in one of your earlier comments you mentioned that you did not see CSEs as a “discovery” venue. That is likely true for the cognoscenti in a given field, but if authoritatively constructed, then “outsiders” could certainly gain useful discovery. For example, assume a CSE managed by the astrophysics community. If I was interested in black holes, that is a great place for me to go and discover.
Second, the theme of discovery also requires some STRUCTURE. Sometimes this can be a taxonomic or directory structure. Who knew that pulsars were in fact related to black holes? Another structure is the categorization or classification of experts that is the subject of ontologies and controlled vocabularies within the semantic Web.
Third, these things can not be done manually given the massive scale of the Internet, yet automation without operators at the control is also of poor quality and ambiguous. The trick, as I argue within a few different posts in my blog (but for which I offer no truly compelling techniques), is to find human-mediated, semi-automated methods that can scale AND produce quality.
(As for automated clustering, two free examples are Clusty and Carrot2; NLP deserves pages of discussion in its own right.)
Fourth, if THAT is done, then multiples of these expert- or interest-driven communities can be aggregated to produce a more meaningful across-the-board resource for information.
Frankly, I think all of this is eminently doable and is happening today in disparate, disconnected ways. The tools for doing this are right at hand. Venue and packaging are lacking.
I find CSEs to be one technique, among truly many, that can contribute to this vision (though, truthfully, CSEs also have some early growing pains — that is likely why you personally are monitoring the Google Group). But, improvements keep coming and it is easily foreseeable that CSEs, plus MANY other techniques, are nearly at hand to do so, so much.
Information is now universal. Collaboration is now doable and demanded. Authoritativeness remains a challenge, but things like OpenId, OpenURL, labels and certificates (not to mention the efforts of existing “authorities” such as professional societies) will create the new social structures to replace the publisher hegemony and peer review methods of prior generations. In the end, society WILL figure out how to bring authoritativeness to the chaotic, distributed, undisciplined Web.