I remember in some of my first jobs in the restaurant industry how surprised I was at the depth of feeling that employees could feel toward one another. Waiters screaming at other waiters; kitchen staff dismissive of out-front staff, everyone sharpening knives about pompous managers, and the like.
Strangely, this past week had many similar flashbacks for me.
If you have been around a bit (not necessarily all the way back to the Tulip frenzy in Holland), you have seen hype and frenzy screw things up. This whole idea of the “last fool” is pretty disgusting and has a real effect on real people. Speculators pushing up house prices 20% per year in Vegas and Miami being only the latest most extreme examples.
Tim Berners-Lee does not blog frequently, but, when he does, it always seems to be at a moment of import.
In his post tonight, I think he is with grace trying to say some things to us. He talks about buzz and hype; he tries to put silly notions about “killer apps” into the background, he emphasizes the real challenge of how a democratized knowledge environment needs to find new measures of trust, and again he talks about the importance of data and linkages.
The real stimulus, I sense, is that in the current frenzy about “semantic Web” stuff his real points are being misunderstood and misquoted.
In all this Semantic Web news, though, the proof of the pudding is in the eating. The benefit of the Semantic Web is that data may be re-used in ways unexpected by the original publisher. That is the value added. So when a Semantic Web start-up either feeds data to others who reuse it in interesting ways, or itself uses data produced by others, then we start to see the value of each bit increased through the network effect.
So if you are a VC funder or a journalist and some project is being sold to you as a Semantic Web project, ask how it gets extra re-use of data, by people who would not normally have access to it, or in ways for which it was not originally designed. Does it use standards? Is it available in RDF? Is there a SPARQL server?
For those of us who have been there before, we fear the hype and cynicism that brought us the dot-com era.
If you feel that you are truly part of a historical transition point — as I and those who have been laborers in the garden of the semantic Web do — then we sense this same potential for an important effort to be hijacked.
The smell of money is in the air; the hype machine is in full swing; VCs are breathy and attentive. Podcasts and blogs are full of baloney. Media excesses are now observable.
One perspective might say that “perspective” will tell us that all of this is natural. We are now in the midst of some expected phase of a Moore chasm or some other predicted evolution of technology hype and development. But, let me ask this: how many times must we be a greater fool to be a lesser fool? We’ve seen this before, and the promise of the semantic Web to do more deserves more.
I wish I had Tim Berners-Lee’s grace; I do not. But, all of us can look around and gain perspective. And, that perspective is: Look for substance and value. Everything else, grab on to your wallet.
When I first saw the advanced blurb for Glut: Mastering Information through the Ages by Alex Wright I thought, “Wow, here is the book I have been looking for or wanting to write myself.” As the book jacket explains:
Spanning disciplines from evolutionary theory and cultural anthropology to the history of books, libraries and computer science, Wright weaves an intriguing narrative that connects such seemingly far-flung topics as insect colonies, Stone Age jewelry, medieval monasteries, Renaissance encyclopedias, early computer networks, and the World Wide Web. Finally, he pulls these threads together to reach a surprising conclusion, suggesting that the future of the information age may lie deep in our cultural past.
Wham, bang! The PR snaps with promise and scope!
These are themes that have been my passion for decades, and I ordered the book as soon as it was announced. It was therefore with great anticipation that I cracked open the cover as soon as I received it. (BTW, the actual date of posting for this review is much later only because I left this review in draft for some months; itself an indication of how, unfortunately, I lost interest in it. ).
The best aspect of Glut is the attention it brings to Paul Otlet, quite likely one of the most unique and overlooked innovators in information science in the 20th century. Frankly, I had only an inkling of who Otlet was prior to this book, and Wright provides a real service by bringing more attention to this forgotten hero.
(I have since gone on to try to learn more about Otlet and his pioneering work in faceted classification — as carried on more notably by S. R. Ranganathan with the Colon classification system — and his ideas behind the creation of the Mundaneum in Brussels in 1910. The Mundaneum and Otlet’s ideas were arguably a forerunner to some aspects of the Internet, Wikipedia and the semantic Web. Unfortunately, the Mundaneum and its 14 million ‘permanent encyclopedia’ items were taken over by German troops in World War II. The facility was ravaged and sank into obscurity, as did Otlet’s reputation, who died in 1944 before the war ended. It was not until Boyd Rayward translated many of Otlet’s seminal works to English in the late 1980s that he was rediscovered.)
Alex Wright’s own Google Tech Talk from Oct. 23, 2007, talks much about Otlet, and is a good summary of some of the other topics in Glut.
The real disappointment in Glut is the lack of depth and scholarship. The basic technique seemed to be find a prominent book on a given topic, summarize it in a popularized tone, sprinkle in a couple of extra references from the source book relied on for that chapter to show a patina of scholarship, and move on to the next chapter. Then, add a few silly appendices to pad the book length.
So, we see, for example, key dependence on a relative few sources for the arguments and points made. Rather than enumerate them here, one approach if interested is to simply peruse the expanded bibliography on Wright’s Glut Web site. That listing is actually quite a good basis for beginning your own collection.
It seems like today, with blogging and digital content flying everywhere, that a greater standard should be set for creating a book and asking the buying public to actually pay for something. That greater standard should be effort and diligence to research the topic at hand.
I feel like Glut is related to similar efforts where not enough homework was done. For example, see Walter Underwood, who in his review of the Everything is Miscellaneous (not!) book, chastises author David Weinberger on similar grounds. (A conclusion I had also reached after viewing this Weinberger video cast.)
In summary, I give Wright an A for scope and a C or D in execution and depth. I realize that is a pretty harsh review; but it is one occasioned by my substantially unmet high hopes and expectations.
The means by which information and document growth has come to be organized, classified and managed have been major factors in humanity’s progress and skyrocketing wealth. Glut‘s skimpy hors d’œuvre merely whet the appetite: the full historical repast has yet to be served.
We are proceeding apace with the first release of the UMBEL (Upper-level Mapping and Binding Exchange Layer) lightweight subject concept ontology. The internal working version presently has 21,580 subject nodes, though further review will certainly change that number before public release of the first draft.
UMBEL defines “subject concepts” as a distinct subset of the more broadly understood concept such as used in the SKOS RDFS controlled vocabulary or formal concept analysis or the very general concepts common to some upper ontologies. Subject concepts are a special kind of concept: ones that are concrete, subject-related and non-abstract. We further contrast these with named entities, which are the real things or instances in the world that are members of these subject concept classes.
Thus, in UMBEL parlance, there are abstract concepts, subject concepts and named entities.
The “backbone” to UMBEL is its set of these reference (“canonical” if you will) subject concepts. These subject concepts are being derived from the OpenCyc version of the Cyc knowledge base. The resulting 22 K nodes of this subject structure are related via the predicates of subclassof and type; these are the graph’s edges. The graph pictures herein are the first glimpse of this UMBEL backbone structure.
We can take the full network graph and do a bit of simulation of diving deep into its structure, as the following figures show.
So, here is the big graph, with all nodes and edges (blue) displayed. This is just about at the limit of our graphing program, Cytoscape, which we estimate is limited to about 30 K nodes:
Through the manipulation of the topological coefficient, which is a relative measure for the extent to which a node shares neighbors with other nodes, we can zoom in on the Top 750 (actually, 759!) node gateways or hubs. There are other ways to evaluate key nodes in a network, but this one fairly nicely approximates the upper structure or hierarchy within the graph:
By tightening the coefficient further, we can get a view of the Top 350 (actually, the top 336). Were the system live and not a captured jpeg, we could zoom in and read the actual node labels.
The real value from a graph structure, of course, is that now we can make selections based on relationships, neighbors and distances for various reasoning, inference or relatedness purposes. This diagram begins by inputting “saab” as my car concept, and then getting all nodes within two links:
Alternatively, for the same “saab” car concept, I asked for all directly related links (in yellow) and did some pruning of car types to make the subgraph more readable and interesting:
This ability to manipulate and navigate this large subject backbone at will should bring immense benefits. And, because of its common sense grounding, the early explorations of this first-glimpse UMBEL structure look very logical and clean.
Once we complete the next packaging and draft release steps, anyone will be able to play with and manipulate this UMBEL structure at will. The ontology and the tools we are using to manipulate it are all open source.
Our next steps on UMBEL will have us publishing the technical report (TR) of how we screened and vetted the subject concepts from the Cyc knowledge base, using an updated OpenCyc version. That document will hopefully gain some broader review and scrutiny for the canonical listing of subject concepts.
Of course, all of that is merely leading up to the Release 0 of the published ontology. We are working diligently to get that posted as well in the very near future.
These graphs were built using the super Cytoscape large-graph visualization framework, which I previously reviewed with glowing praise. The subgraph extractions were greatly aided by a fantastic add-in called NetworkAnalyzer from the Max-Planck-Institut fÃ¼r Informatik. I will be writing more about this add-in at a later time, including some guidance for how to use it for meaningful ontology analysis. But, in the meantime, do check this add-in tool out. Mucho cool, and another winner !
Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research. The system is being tapped for both data and structure. Wikipedia has arguably replaced WordNet as the leading lexicon for concepts and relations. Because of its scope and popularity, many argue that Wikipedia is emerging as the de facto structure for classifying and organizing knowledge in the 21st century.
Our work on the UMBEL lightweight reference subject concept structure has stated since the project’s announcement in July 2007 that Wikipedia is a key intended resource for identifying subject concepts and entities. For the past few months I have been scouring the globe attempting to find every drop of research I could find on the use of Wikipedia for semantic Web, information extraction, categorization and related issues.
Thus, I’m pleased to offer up herein the most comprehensive such listing available anywhere: more than 99 resources and counting! (I say “more than” because some entries below have multiple resources; I just liked the sound of 99 as a round number!)
Wikipedia itself maintains a listing of academic studies using Wikipedia as a resource; fewer than one-third of the listings below are on that list (which itself may be an indication of the current state of completeness within Wikipedia). Some bloggers and other sources around the Web also maintain listings in lesser degrees of completeness.
It is well documented the tremendous growth of content and topics within Wikipedia (see, as examples, the W1, W2, W3, W4, W5, W6 and W7 internal Wikipedia sources for gory details), with as of early 2008 about 2.25 million articles in English and versions in 256 languages and variants.
Download access to the full knowledge base has enabled the development of notable core references to the Linked Data aspects of the semantic Web such as DBpedia [5,6] and YAGO [72,73]. Entire research teams, such as Ponzetto and Strube [61-65] (and others as well; see below) are moving toward creating a full-blown ontologies or structured knowledge bases useful for semantic Web purposes based on Wikipedia. So, one of the first and principle uses of Wikipedia to date has been as a data source of concepts, entities and relations.
But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:
These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:
These are some of the specific uses that are included in the 99 resources listed below.
This is an exciting (and, for most all of us just a few years back, unanticipated) use of the Web in socially relevant and contextual knowledge and research. I’m sure such a listing one year by now will be double in size or larger!
BTW, suggestions for new or overlooked entries are very much welcomed!
The pending UMBEL subject concept “backbone” ontology will involve literally thousands of concepts. In order to manage and view such a large structure, a concerted effort to find suitable graph visualization software was mounted. This post presents the candidate listing, as well as some useful starting resources and background information.
A subsequent post will present the surprise winner of our evaluation.
For grins, you may also like to see various example visualizations, most with a large-graph bent:
Here is the listing of 26 candidate graph visualization programs assembled to date:
IVC Software Framework – the InfoVis Cyberinfrastructure (IVC) software framework is a set of libraries that provide a simple and uniform programming-interface to algorithms and user-interface to end-users by leveraging the power of the Eclipse Rich Client Platform (RCP)