Posted:March 27, 2008

I remember in some of my first jobs in the restaurant industry how surprised I was at the depth of feeling that employees could feel toward one another. Waiters screaming at other waiters; kitchen staff dismissive of out-front staff, everyone sharpening knives about pompous managers, and the like.

Strangely, this past week had many similar flashbacks for me.

If you have been around a bit (not necessarily all the way back to the Tulip frenzy in Holland), you have seen hype and frenzy screw things up. This whole idea of the “last fool” is pretty disgusting and has a real effect on real people. Speculators pushing up house prices 20% per year in Vegas and Miami being only the latest most extreme examples.

Tim Berners-Lee does not blog frequently, but, when he does, it always seems to be at a moment of import.

In his post tonight, I think he is with grace trying to say some things to us. He talks about buzz and hype; he tries to put silly notions about “killer apps” into the background, he emphasizes the real challenge of how a democratized knowledge environment needs to find new measures of trust, and again he talks about the importance of data and linkages.

The real stimulus, I sense, is that in the current frenzy about “semantic Web” stuff his real points are being misunderstood and misquoted.

Listen, carefully:

In all this Semantic Web news, though, the proof of the pudding is in the eating. The benefit of the Semantic Web is that data may be re-used in ways unexpected by the original publisher. That is the value added. So when a Semantic Web start-up either feeds data to others who reuse it in interesting ways, or itself uses data produced by others, then we start to see the value of each bit increased through the network effect.

So if you are a VC funder or a journalist and some project is being sold to you as a Semantic Web project, ask how it gets extra re-use of data, by people who would not normally have access to it, or in ways for which it was not originally designed. Does it use standards? Is it available in RDF? Is there a SPARQL server?

For those of us who have been there before, we fear the hype and cynicism that brought us the dot-com era.

If you feel that you are truly part of a historical transition point — as I and those who have been laborers in the garden of the semantic Web do — then we sense this same potential for an important effort to be hijacked.

The smell of money is in the air; the hype machine is in full swing; VCs are breathy and attentive. Podcasts and blogs are full of baloney. Media excesses are now observable.

One perspective might say that “perspective” will tell us that all of this is natural. We are now in the midst of some expected phase of a Moore chasm or some other predicted evolution of technology hype and development. But, let me ask this: how many times must we be a greater fool to be a lesser fool? We’ve seen this before, and the promise of the semantic Web to do more deserves more.

I wish I had Tim Berners-Lee’s grace; I do not. But, all of us can look around and gain perspective. And, that perspective is: Look for substance and value. Everything else, grab on to your wallet.

Posted by AI3's author, Mike Bergman Posted on March 27, 2008 at 10:06 pm in Adaptive Information, Semantic Web | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/429/grace-and-perspective/
The URI to trackback this post is: http://www.mkbergman.com/429/grace-and-perspective/trackback/
Posted:March 2, 2008

Glut: Mastering Information Through The Ages

Wright’s Book Has Strong Scope, Disappointing Delivery

When I first saw the advanced blurb for Glut: Mastering Information through the Ages by Alex Wright I thought, “Wow, here is the book I have been looking for or wanting to write myself.” As the book jacket explains:

Spanning disciplines from evolutionary theory and cultural anthropology to the history of books, libraries and computer science, Wright weaves an intriguing narrative that connects such seemingly far-flung topics as insect colonies, Stone Age jewelry, medieval monasteries, Renaissance encyclopedias, early computer networks, and the World Wide Web. Finally, he pulls these threads together to reach a surprising conclusion, suggesting that the future of the information age may lie deep in our cultural past.

Wham, bang! The PR snaps with promise and scope!

These are themes that have been my passion for decades, and I ordered the book as soon as it was announced. It was therefore with great anticipation that I cracked open the cover as soon as I received it. (BTW, the actual date of posting for this review is much later only because I left this review in draft for some months; itself an indication of how, unfortunately, I lost interest in it. :( ).

Otlet is a Gem

The best aspect of Glut is the attention it brings to Paul Otlet, quite likely one of the most unique and overlooked innovators in information science in the 20th century. Frankly, I had only an inkling of who Otlet was prior to this book, and Wright provides a real service by bringing more attention to this forgotten hero.

(I have since gone on to try to learn more about Otlet and his pioneering work in faceted classification — as carried on more notably by S. R. Ranganathan with the Colon classification system — and his ideas behind the creation of the Mundaneum in Brussels in 1910. The Mundaneum and Otlet’s ideas were arguably a forerunner to some aspects of the Internet, Wikipedia and the semantic Web. Unfortunately, the Mundaneum and its 14 million ‘permanent encyclopedia’ items were taken over by German troops in World War II. The facility was ravaged and sank into obscurity, as did Otlet’s reputation, who died in 1944 before the war ended. It was not until Boyd Rayward translated many of Otlet’s seminal works to English in the late 1980s that he was rediscovered.)

Alex Wright’s own Google Tech Talk from Oct. 23, 2007, talks much about Otlet, and is a good summary of some of the other topics in Glut.

Stapled Book Reviews

The real disappointment in Glut is the lack of depth and scholarship. The basic technique seemed to be find a prominent book on a given topic, summarize it in a popularized tone, sprinkle in a couple of extra references from the source book relied on for that chapter to show a patina of scholarship, and move on to the next chapter. Then, add a few silly appendices to pad the book length.

So, we see, for example, key dependence on a relative few sources for the arguments and points made. Rather than enumerate them here, one approach if interested is to simply peruse the expanded bibliography on Wright’s Glut Web site. That listing is actually quite a good basis for beginning your own collection.

Books are Different

It seems like today, with blogging and digital content flying everywhere, that a greater standard should be set for creating a book and asking the buying public to actually pay for something. That greater standard should be effort and diligence to research the topic at hand.

I feel like Glut is related to similar efforts where not enough homework was done. For example, see Walter Underwood, who in his review of the Everything is Miscellaneous (not!) book, chastises author David Weinberger on similar grounds. (A conclusion I had also reached after viewing this Weinberger video cast.)

In summary, I give Wright an A for scope and a C or D in execution and depth. I realize that is a pretty harsh review; but it is one occasioned by my substantially unmet high hopes and expectations.

The means by which information and document growth has come to be organized, classified and managed have been major factors in humanity’s progress and skyrocketing wealth. Glut‘s skimpy hors d’œuvre merely whet the appetite: the full historical repast has yet to be served.

Posted:February 20, 2008

UMBEL

Here’s a Sneak Peek at Some UMBEL Subject Graphs

We are proceeding apace with the first release of the UMBEL (Upper-level Mapping and Binding Exchange Layer) lightweight subject concept ontology. The internal working version presently has 21,580 subject nodes, though further review will certainly change that number before public release of the first draft.

UMBEL defines “subject concepts” as a distinct subset of the more broadly understood concept such as used in the SKOS RDFS controlled vocabulary or formal concept analysis or the very general concepts common to some upper ontologies. Subject concepts are a special kind of concept: ones that are concrete, subject-related and non-abstract. We further contrast these with named entities, which are the real things or instances in the world that are members of these subject concept classes.

Thus, in UMBEL parlance, there are abstract concepts, subject concepts and named entities.

The “backbone” to UMBEL is its set of these reference (“canonical” if you will) subject concepts. These subject concepts are being derived from the OpenCyc version of the Cyc knowledge base. The resulting 22 K nodes of this subject structure are related via the predicates of subclassof and type; these are the graph’s edges. The graph pictures herein are the first glimpse of this UMBEL backbone structure.

The Deep Dive

We can take the full network graph and do a bit of simulation of diving deep into its structure, as the following figures show.

The Big Graph

So, here is the big graph, with all nodes and edges (blue) displayed. This is just about at the limit of our graphing program, Cytoscape, which we estimate is limited to about 30 K nodes:

UMBEL Big Braph

The Top 750

Through the manipulation of the topological coefficient, which is a relative measure for the extent to which a node shares neighbors with other nodes, we can zoom in on the Top 750 (actually, 759!) node gateways or hubs. There are other ways to evaluate key nodes in a network, but this one fairly nicely approximates the upper structure or hierarchy within the graph:

UMBEL Top 750

The Top 350

By tightening the coefficient further, we can get a view of the Top 350 (actually, the top 336). Were the system live and not a captured jpeg, we could zoom in and read the actual node labels.

UMBEL Top 350

Two Degrees of Separation: Saab Example

The real value from a graph structure, of course, is that now we can make selections based on relationships, neighbors and distances for various reasoning, inference or relatedness purposes. This diagram begins by inputting “saab” as my car concept, and then getting all nodes within two links:

UMBEL Big Saab

The Saab Neighborhood

Alternatively, for the same “saab” car concept, I asked for all directly related links (in yellow) and did some pruning of car types to make the subgraph more readable and interesting:

UMBEL Saab Neighborhood

This ability to manipulate and navigate this large subject backbone at will should bring immense benefits. And, because of its common sense grounding, the early explorations of this first-glimpse UMBEL structure look very logical and clean.

UMBEL Status

Once we complete the next packaging and draft release steps, anyone will be able to play with and manipulate this UMBEL structure at will. The ontology and the tools we are using to manipulate it are all open source.

Our next steps on UMBEL will have us publishing the technical report (TR) of how we screened and vetted the subject concepts from the Cyc knowledge base, using an updated OpenCyc version. That document will hopefully gain some broader review and scrutiny for the canonical listing of subject concepts.

Of course, all of that is merely leading up to the Release 0 of the published ontology. We are working diligently to get that posted as well in the very near future.

A Note on the Graphs

These graphs were built using the super Cytoscape large-graph visualization framework, which I previously reviewed with glowing praise. The subgraph extractions were greatly aided by a fantastic add-in called NetworkAnalyzer from the Max-Planck-Institut für Informatik. I will be writing more about this add-in at a later time, including some guidance for how to use it for meaningful ontology analysis. But, in the meantime, do check this add-in tool out. Mucho cool, and another winner !

Posted by AI3's author, Mike Bergman Posted on February 20, 2008 at 4:21 pm in Adaptive Information, Semantic Web, Structured Web, UMBEL | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/419/so-what-might-the-webs-subject-backbone-look-like/
The URI to trackback this post is: http://www.mkbergman.com/419/so-what-might-the-webs-subject-backbone-look-like/trackback/
Posted:February 18, 2008

W3C Semantic WebWikipedia

Most Comprehensive Reference List Available Shows Impressive Depth, Breadth

Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research. The system is being tapped for both data and structure. Wikipedia has arguably replaced WordNet as the leading lexicon for concepts and relations. Because of its scope and popularity, many argue that Wikipedia is emerging as the de facto structure for classifying and organizing knowledge in the 21st century.

Our work on the UMBEL lightweight reference subject concept structure has stated since the project’s announcement in July 2007 that Wikipedia is a key intended resource for identifying subject concepts and entities. For the past few months I have been scouring the globe attempting to find every drop of research I could find on the use of Wikipedia for semantic Web, information extraction, categorization and related issues.

Thus, I’m pleased to offer up herein the most comprehensive such listing available anywhere: more than 99 resources and counting! (I say “more than” because some entries below have multiple resources; I just liked the sound of 99 as a round number!)

Wikipedia itself maintains a listing of academic studies using Wikipedia as a resource; fewer than one-third of the listings below are on that list (which itself may be an indication of the current state of completeness within Wikipedia). Some bloggers and other sources around the Web also maintain listings in lesser degrees of completeness.

It is well documented the tremendous growth of content and topics within Wikipedia (see, as examples, the W1, W2, W3, W4, W5, W6 and W7 internal Wikipedia sources for gory details), with as of early 2008 about 2.25 million articles in English and versions in 256 languages and variants.

Download access to the full knowledge base has enabled the development of notable core references to the Linked Data aspects of the semantic Web such as DBpedia [5,6] and YAGO [72,73]. Entire research teams, such as Ponzetto and Strube [61-65] (and others as well; see below) are moving toward creating a full-blown ontologies or structured knowledge bases useful for semantic Web purposes based on Wikipedia. So, one of the first and principle uses of Wikipedia to date has been as a data source of concepts, entities and relations.

But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:

  • Ontology development and categorization
  • Word sense disambiguation
  • Named entity recognition
  • Named entity disambiguation
  • Semantic relatedness and relations.

These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:

  • Articles
    • First paragraph — Definitions
    • Full text — Description of meaning; related terms; translations
    • Redirects — Synonymy; spelling variations, misspellings; abbreviations
    • Title — Named entities; domain specific terms or senses
    • Subject — Category suggestion (phrase marked in bold or in first paragraph)
    • Section heading — Category suggestions
  • Article links
    • Context — Related terms; co-occurrences
    • Label — Synonyms; spelling variations; related terms
    • Target — Link graph; related terms
    • LinksTo — Category suggestion
    • LinkedBy — Category suggestion
  • Categories
    • Category — Category suggestion
    • Contained articles — Semantically related terms (siblings)
    • Hierarchy — Hyponymic and meronymic relations between terms
  • Disambiguation pages
    • Article links — Sense inventory
  • Infobox Templates
    • Name –
    • Item — Category suggestion; entity suggestion
  • Lists
    • Hyponyms

These are some of the specific uses that are included in the 99 resources listed below.

This is an exciting (and, for most all of us just a few years back, unanticipated) use of the Web in socially relevant and contextual knowledge and research. I’m sure such a listing one year by now will be double in size or larger!

BTW, suggestions for new or overlooked entries are very much welcomed! :)

A – B

  1. Sisay Fissaha Adafre and Maarten de Rijke, 2006. Finding Similar Sentences across Multiple Languages in Wikipedia, in EACL 2006 Workshop on New Text–Wikis and Blogs and Other Dynamic Text Sources, April 2006.See http://www.science.uva.nl/~mdr/Publications/Files/eacl2006-similarsentences.pdf.
  2. Sisay Fissaha Adafre, V. Jijkoun and M. de Rijke. Fact Discovery in Wikipedia, in 2007 IEEE/WIC/ACM International Conference on Web Intelligence. See http://staff.science.uva.nl/~mdr/Publications/Files/wi2007.pdf.
  3. Sisay Fissaha Adafre and Maarten de Rijke, 2005. Discovering Missing Links in Wikipedia, in LinkKDD 2005, August 21, 2005, Chicago, IL. See http://data.isi.edu/conferences/linkkdd-05/Download/Papers/linkkdd05-13.pdf.
  4. David Ahn, Valentin Jijkoun, Gilad Mishne, Karin Müller, Maarten de Rijke, and Stefan Schlobach. 2004. Using Wikipedia at the TREC QA Track, in Proceedings of TREC 2004.
  5. Sören Auer, Chris Bizer, Jens Lehmann, Georgi Kobilarov, Richard Cyganiak and Zachary Ives, 2007. DBpedia: A nucleus for a web of open data, in Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea, volume 4825 of LNCS, pages 715–728, November 2007. See http://iswc2007.semanticweb.org/papers/ISWC2007_IU_Auer.pdf.
  6. Sören Auer and Jens Lehmann, 2007. What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content, in The Semantic Web: Research and Applications, pages 503-517, 2007. See http://www.eswc2007.org/pdf/eswc07-auer.pdf.
  7. Somnath Banerjee, 2007. Boosting Inductive Transfer for Text Classification Using Wikipedia, at the Sixth International Conference on Machine Learning and Applications (ICMLA). See http://portal.acm.org/citation.cfm?id=1336953.1337115.
  8. Somnath Banerjee, Krishnan Ramanathan, Ajay Gupta, 2007. Clustering Short Texts using Wikipedia, poster presented at Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, pp. 787-788.
  9. F. Bellomi and R. Bonato, 2005. Lexical Authorities in an Encyclopedic Corpus: A Case Study with Wikipedia, online reference not found.
  10. F. Bellomi and R. Bonato, 2005. Network Analysis for Wikipedia, presented at Wikimania 2005; see http://www.fran.it/articles/wikimania_bellomi_bonato.pdf
  11. Bibauw (2005) analyzed the lexicographical structure of Wiktionary (in French).
  12. Abhijit Bhole, Blaž Fortuna, Marko Grobelnik and Dunja Mladenić, 2007.Mining Wikipedia and Relating Named Entities over Time, see http://www.cse.iitb.ac.in/~abhijit.bhole/SiKDD2007ExtractingWikipedia.pdf.
  13. Razvan Bunescu and Marius Pasca. 2006. Using Encyclopedic Knowledge for Named Entity Disambiguation, in Proceedings of the 11th Conference of the EACL, pages 9-16, Trento, Italy.
  14. Razvan Bunescu, 2007. Learning for Information Extraction: From Named Entity Recognition and Disambiguation To Relation Extraction, Ph.D. thesis for the University of Texas, August 2007, 168 pp. See http://oucsace.cs.ohiou.edu/~razvan/papers/thesis-white.pdf.
  15. Luciana Buriol, Carlos Castillo, Debora Donato, Stefano Leonardi, and Stefano Millozzi. 2006. Temporal Analysis of the Wikigraph, in Proceedings of Web Intelligence, Hong Kong.
  16. Davide Buscaldi and Paolo Rosso, 2007. A Comparison of Methods for the Automatic Identification of Locations in Wikipedia, in Proceedings of GIR’07, November 9, 2007, Lisbon, Portugal, pp 89-91. See http://www.dsic.upv.es/~prosso/resources/BuscaldiRosso_GIR07.pdf.

C – F

  1. Ruiz-Casado, M., Alfonseca, E., and Castells, P. 2005. Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets, in AWIC, pages 380-386. See http://nets.ii.uam.es/publications/nlp/awic05.pdf.
  2. Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells, 2006. From Wikipedia to Semantic Relationships: a Semi-automated Annotation Approach, in ESWC2006.
  3. Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells, 2007. Automatising the Learning of Lexical Patterns: an Application to the Enrichment of WordNet by Extracting Semantic Relationships from Wikipedia. See http://nets.ii.uam.es/publications/nlp/dke07.pdf.
  4. Sergey Chernov, Tereza Iofciu, Wolfgang Nejdl, and Xuan Zhou, 2006. Extracting Semantic Relationships between Wikipedia Categories, from SEMWIKI 2006.
  5. Aron Culotta, Andrew McCallum and Jonathan Betz, 2006. Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text, in Proceedings of HLT-NAACL-2006. See http://www.cs.umass.edu/~culotta/pubs/culotta06integrating.pdf
  6. Silviu Cucerzan, 2007. Large-Scale Named Entity Disambiguation Based on Wikipedia Data, in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). See http://www.aclweb.org/anthology-new/D/D07/D07-1074.pdf.
  7. Cyclopedia, from the Cyc Foundation, an online version of Wikipedia that enables browsing the encyclopedia by concepts. See http://www.cycfoundation.org/blog/?page_id=15.
  8. Wisam Dakka and Silviu Cucerzan, 2008. Augmenting Wikipedia with Named Entity Tags, to be published in IJCNLP. See http://research.microsoft.com/users/silviu/Papers/np-ijcnlp08.pdf.
  9. Wisam Dakka and Silviu Cucerzan, 2008. Also, see the online tool http://wikinet.stern.nyu.edu/ (not yet implemented).
  10. Turdakov Denis, 2007. Recommender System Based on User-generated Content. See http://syrcodis.citforum.ru/2007/9.pdf.
  11. EachWiki, online system from Fu et al.
  12. Linyun Fu, Haofen Wang, Haiping Zhu, Huajie Zhang, Yang Wang and Yong Yu, 2007. Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring, at ISWC 2007. See http://iswc2007.semanticweb.org/papers/ISWC2007_RT_Fu.pdf.

G – H

  1. Evgeniy Gabrilovich and Shaul Markovitch. 2006. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge, in AAAI, pages 1301-1306, Boston, MA.
  2. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, in Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007.
  3. Evgeniy Gabrilovich, 2006. Feature Generation for Textual Information Using World Knowledge, Ph.D. Thesis for The Technion – Israel Institute of Technology, Haifa, Israel, December 2006, 218 pp. See http://www.cs.technion.ac.il/~gabr/papers/phd-thesis.pdf. (There is also an informative video at http://www.researchchannel.org/prog/displayevent.aspx?rID=4915).)
  4. See also Gabrilovich’s Perl Wikipedia tool, WikiPrep.
  5. Rudiger Gleim, Alexander Mehler and Matthias Dehmer, 2007. Web Corpus Mining by Instance of Wikipedia, in Proc. 2nd Web as Corpus Workshop at EACL 2006. See http://acl.ldc.upenn.edu/W/W06/W06-1710.pdf.
  6. Andrew Gregorowicz and Mark A. Kramer, 2006. Mining a Large-Scale Term-Concept Network from Wikipedia, Mitre Technical Report, October 2006. See http://www.mitre.org/work/tech_papers/tech_papers_06/06_1028/06_1028.pdf.
  7. Andreas Harth, Hannes Gassert, Ina O’Murchu, John Breslin and Stefan Decker, 2005. WikiOnt: An Ontology for Describing and Exchanging Wiki Articles, presented at Wikimania, Frankfurt, 5th August 2005. See http://sw.deri.org/~jbreslin/presentations/20050805a.pdf.
  8. Martin Hepp, Daniel Bachlechner and Katharina Siorpaes, 2006. Harvesting Wiki Consensus – Using Wikipedia Entries as Ontology Elements, See http://www.heppnetz.de/files/SemWiki2006-Harvesting%20Wiki%20Consensus-LNCS-final.pdf.
  9. Martin Hepp, Katharina Siorpaes, Daniel Bachlechner, 2007. Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management, in IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 2007. See http://www.heppnetz.de/files/hepp-siorpaes-bachlechner-harvesting%20wikipedia%20w5054.pdf. Also, sample data at http://www.heppnetz.de/harvesting-wikipedia/.
  10. A. Herbelot and Ann Copestake, 2006. Acquiring Ontological Relationships from Wikipedia Using RMRS, in Proc. International Semantic Web Conference 2006 Workshop, Web Content Mining with Human Language Technologies, Athens, GA, 2006. See http://orestes.ii.uam.es/workshop/12.pdf.
  11. Ryuichiro Higashinaka, Kohji Dohsaka and Hideki Isozaki, 2007. Learning to Rank Definitions to Generate Quizzes for Interactive Information Presentation, in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; see pages 117-120 within http://acl.ldc.upenn.edu/P/P07/P07-2.pdf
  12. Todd Holloway, Miran Bozicevic, and Katy Börner. 2005. Analyzing and Visualizing the Semantic Coverage of Wikipedia and Its Authors. ArXiv Computer Science e-prints, cs/0512085.
  13. Wei Che Huang, Andrew Trotman, and Shlomo Geva, 2007. Collaborative Knowledge Management: Evaluation of Automated Link Discovery in the Wikipedia, in SIGIR 2007 Workshop on Focused Retrieval, July 27, 2007, Amsterdam, The Netherlands. See http://www.cs.otago.ac.nz/sigirfocus/paper_15.pdf.

I – K

  1. Jonathan Isbell and Mark H. Butler, 2007. Extracting and Re-using Structured Data from Wikis, Hewlett-Packard Technical Report HPL-2007-182, 14th November, 2007, 22 pp. See http://www.hpl.hp.com/techreports/2007/HPL-2007-182.pdf.
  2. Maciej Janik and Krys Kochut, 2007. Wikipedia in Action: Ontological Knowledge in Text Categorization, University of Georgia, Computer Science Department Technical Report no. UGA-CS-TR-07-001. See http://lsdis.cs.uga.edu/~mjanik/UGA-CS-TR-07-001.pdf.
  3. Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath and Gerhard Weikum, 2007. NAGA: Searching and Ranking Knowledge, Technical Report, Max-Planck-Institut f¨ur Informatik, MPI–I–2007–5–001, March 2007, 42 pp.See http://www.mpi-inf.mpg.de/~kasneci/naga/report.pdf
  4. Jun’ichi Kazama and Kentaro Torisawa, 2007. Exploiting Wikipedia as External Knowledge for Named Entity Recognition, in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 698–707, Prague, June 2007.See http://acl.ldc.upenn.edu/D/D07/D07-1073.pdf.
  5. Daniel Kinzler, 2005. WikiSense Mining the Wiki, V 1.1, presented at Wikimania 2005, 10 pp. See http://brightbyte.de/repos/papers/2005/WikiSense-Presentation.pdf.
  6. A. Krizhanovsky, 2006. Synonym Search in Wikipedia: Synarcher, in 11th International Conference “Speech and Computer” SPECOM’2006. Russia, St. Petersburg, June 25-29, 2006, pp. 474-477. See http://arxiv.org/abs/cs/0606097 (also PDF).
  7. Natalia Kozlova, 2005. Automatic Ontology Extraction for Document Classification, a Master’s Thesis for Saarland University, February 2005, 90 pp. See http://domino.mpi-inf.mpg.de/imprs/imprspubl.nsf/80255f02006a559a80255ef20056fc02/3b864d86612739b0c1256fb70042547a/$FILE/Masterarbeit-Kozlova-Nat-2005.pdf. [READ]
  8. M. Krötzsch, D. Vrandečić and M. Völkel, 2005. Wikipedia and the Semantic Web: the Missing Links, at Wikimania 2005.Wikipedia and the Semantic Web: the Missing Links, at Wikimania 2005. See http://citeseer.ist.psu.edu/cache/papers/cs2/143/http:zSzzSzwww.aifb.uni-karlsruhe.dezSzWBSzSzmakzSzpubzSzwikimania.pdf/krotzsch05wikipedia.pdf.

L – N

  1. Rada Mihalcea and Andras Csomai, 2007. Wikify! Linking Documents to Encyclopedic Knowledge, in Proceedings of the Sixteenth ACM conference on Conference on information and knowledge management CIKM ’07 , November 6-8, 2007, pp. 233-241. See ACM portal retrieval (http://portal.acm.org/citation.cfm?id=1321475&coll=Portal&dl=ACM&CFID=8333672&CFTOKEN=48251251).
  2. Rada Mihalcea, 2007. Using Wikipedia for Automatic Word Sense Disambiguation, in Proceedings of NAACL HLT 2007, pages 196–203, April 2007.See http://www.cs.unt.edu/~rada/papers/mihalcea.naacl07.pdf.
  3. David Milne, 2007. Computing Semantic Relatedness using Wikipedia Link Structure; see also the Wikipedia Miner Toolkit (http://sourceforge.net/projects/wikipedia-miner/) provided by the author
  4. D. Milne, O. Medelyan and I. H. Witten, 2006. Mining Domain-Specific Thesauri from Wikipedia: A Case Study, in Proceedings of the International Conference on Web Intelligence (IEEE/WIC/ACM WI’2006), Hong Kong. Also Olena Medelyan’s home page, http://www.cs.waikato.ac.nz/~olena/.
  5. David Milne, Ian H. Witten and David M. Nichols, 2007. A Knowledge-Based Search Engine Powered by Wikipedia, at CIKM ’07.
  6. Lev Muchnik, Royi Itzhack, Sorin Solomon and Yoram Louzoun, 2007. Self-emergence of Knowledge Trees: Extraction of the Wikipedia Hierarchies, in Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), Vol. 76, No. 1.
  7. Nadeau, D., Turney, P., Matwin, S., 2006. Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity, at 19th Canadian Conference on Artificial Intelligence. Québec City, Québec, Canada. June 7, 2006.See http://www.iit-iti.nrc-cnrc.gc.ca/iit-publications-iti/docs/NRC-48727.pdf. Doesn’t specifically use Wikipedia, but techniques are applicable.
  8. Kotaro Nakayama, Takahiro Hara and Shojiro Nishio, 2007. Wikipedia Mining for an Association Web Thesaurus Construction, in Web Information Systems Engineering – WISE 2007, Vol. 4831 (2007), pp. 322-334.
  9. Dat P.T. Nguyen, Yutaka Matsuo and Mitsuru Ishizuka, 2007. Relation Extraction from Wikipedia Using Subtree Mining, from AAAI ’07.

O – P

  1. Yann Ollivier and Pierre Senellar, 2007. Finding Related Pages Using Green Measures: An Illustration with Wikipedia, in Proceedings of the AAAI-07 Conference. See http://pierre.senellart.com/publications/ollivier2006finding.pdf. See also http://pierre.senellart.com/publications/ollivier2006finding/ for tools and data.
  2. Simon Overell and Stefan Ruger, 2006. Identifying and Grounding Descriptions of Places, in SIGIR Workshop on Geographic Information Retrieval, pages 14–16, 2006.See http://mmis.doc.ic.ac.uk/www-pub/sigir06-GIR.pdf.
  3. Simone Paolo Ponzetto and Michael Strube, 2006. Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution, in NAACL 2006. See http://www.eml-research.de/english/homes/strube/papers/naacl06.pdf.
  4. Simone Paolo Ponzetto and Michael Strube, 2007a. Deriving a Large Scale Taxonomy from Wikipedia, in Association for the Advancement of Artificial Intelligence (AAAI2007).
  5. Simone Paolo Ponzetto and Michael Strube, 2007b. Knowledge Derived From Wikipedia For Computing Semantic Relatedness, in Journal of Artificial Intelligence Research 30 (2007) 181-212. See also these PPT slides (in PDF format): Part I, Part II, and References.
  6. Simone Paolo Ponzetto and Michael Strube, 2007c. An API for Measuring the Relatedness of Words in Wikipedia, in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, See pages 49-52 within http://acl.ldc.upenn.edu/P/P07/P07-2.pdf
  7. Simone Paolo Ponzetto, 2007. Creating a Knowledge Base from a Collaboratively Generated Encyclopedia, in Proceedings of the NAACL-HLT 2007 Doctoral Consortium, pp 9-12, Rochester, NY, April 2007. See http://www.aclweb.org/anthology-new/N/N07/N07-3003.pdf.

R – S

  1. Tyler Riddle. 2006. Parse::mediawikidump. URL http://search.cpan.org/~triddle/Parse-MediaWikiDump-0.40/.
  2. Ralf Schenkel, Fabian Suchanek and Gjergji Kasneci, 2007. YAWN: A Semantically Annotated Wikipedia XML Corpus, in BTW2007.
  3. Péter Schönhofen, 2006. Identifying Document Topics Using the Wikipedia Category Network, in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 456-462, 2006.See http://amber.exp.sis.pitt.edu/gale/paper/Identifying%20document%20topics%20using%20the%20Wikipedia%20category%20network.ppt.
  4. Börkur Sigurbjörnsson, Jaap Kamps, and Maarten de Rijke. 2006. Focused Access to Wikipedia. In Proceedings DIR-2006.
  5. Suzette Kruger Stoutenburg, 2007. Research Proposal: Acquiring the Semantic Relationships of Links between Wikipedia Articles, class proposal to http://www.coloradostoutenburg.com/cs589%20stoutenburg%20project%20proposal%20v1.doc
  6. Strube, M. and Ponzetto, S. P. 2006. WikiRelate! Computing Semantic Relatedness Using Wikipedia, in AAAI, pages 1419-1424, Boston, MA.
  7. Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, Yago – A Core of Semantic Knowledge, in 16th international World Wide Web Conference (WWW 2007), Banff, Alberta. See http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/.
  8. Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, 2007. Yago: A Large Ontology from Wikipedia and WordNet, in Technical Report, submitted to the Elsevier Journal of Web Semantics 67 pp. See http://www.mpi-inf.mpg.de/~suchanek/publications/yagotr.pdf.
  9. Fabian M. Suchanek, Georgiana Ifrim and Gerhard Weikum, 2006. Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents, in Knowledge Discovery and Data Mining (KDD 2006). See http://www.mpi-inf.mpg.de/~ifrim/publications/kdd2006.pdf.
  10. S. Suh, H. Halpin and E. Klein, 2006. Extracting Common Sense Knowledge from Wikipedia, in Proc. International Semantic Web Conference 2006 Workshop, Web Content Mining with Human Language Technologies, Athens, GA, 2006. See http://orestes.ii.uam.es/workshop/22.pdf.
  11. Zareen Syed, Tim Finin, and Anupam Joshi, 2008. Wikipedia as an Ontology for Describing Documents, from Proceedings of the Second International Conference on Weblogs and Social Media, AAAI, March 31, 2008. See http://ebiquity.umbc.edu/paper/html/id/383/Wikipedia-as-an-Ontology-for-Describing-Documents.

T – V

  1. J. A. Thom, J, Pehcevski and A. M. Vercoustre, 2007. Use of Wikipedia Categories in Entity Ranking. in Proceedings of the 12th Australasian Document Computing Symposium, Melbourne, Australia (2007). See http://arxiv.org/PS_cache/arxiv/pdf/0711/0711.2917v1.pdf.
  2. Antonio Toral and Rafael Muñoz, 2007. Towards a Named Entity Wordnet (NEWN), in Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). pp. 604-608 . September 2007.See http://www.dlsi.ua.es/~atoral/publications/2007_ranlp_newn_poster.pdf.
  3. Antonio Toral and Rafael Muñoz, 2006. A Proposal to Automatically Build and Maintain Gazetteers for Named Entity Recognition by using Wikipedia, in Workshop on New Text, 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento (Italy). April 2006.See http://www.dlsi.ua.es/~atoral/publications/2006_eacl-newtext_wiki-ner_paper.pdf.
  4. Anne-Marie Vercoustre, Jovan Pehcevski and James A. Thom, 2007. Using Wikipedia Categories and Links in Entity Ranking, in Pre-proceedings of the Sixth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2007), Dec 17, 2007. See http://hal.inria.fr/docs/00/19/24/89/PDF/inex07.pdf.
  5. Anne-Marie Vercoustre, James A. Thom and Jovan Pehcevski, 2008. Entity Ranking in Wikipedia, in SAC’08 March 16-20, 2008, Fortaleza, Ceara, Brazil. See http://arxiv.org/PS_cache/arxiv/pdf/0711/0711.3128v1.pdf.
  6. Max Völkel, Markus Krötzsch, Denny Vrandecic, Heiko Haller and Rudi Studer, 2006. Semantic Wikipedia, in Proceedings of WWW2006, pp 585-594.See http://www.aifb.uni-karlsruhe.de/WBS/hha/papers/SemanticWikipedia.pdf
  7. Jakob Voss. 2006. Collaborative Thesaurus Tagging the Wikipedia Way. ArXiv Computer Science e-prints, cs/0604036. See http://arxiv.org/abs/cs.IR/0604036
  8. Denny Vrandecic, Markus Krötzsch and Max Völkel, 2007. Wikipedia and the Semantic Web, Part II, in Phoebe Ayers and Nicholas Boalch, Proceedings of Wikimania 2006 – The Second International Wikimedia Conference, Wikimedia Foundation, Cambridge, MA, USA, August 2007.See http://wikimania2006.wikimedia.org/wiki/Proceedings:DV1.

W – Z

  1. Wang Y , Wang H , Zhu H , Yu Y, 2007. Exploit Semantic Information for Category Annotation Recommendation in Wikipedia, in Natural Language Processing and Information Systems (2007), pp. 48-60.
  2. Gang Wang, Yong Yu and Haiping Zhu, 2007. PORE: Positive-Only Relation Extraction from Wikipedia Text, at ISWC 2007, See http://iswc2007.semanticweb.org/papers/ISWC2007_RT_Wang(1).pdf.
  3. Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto, 2007. A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, in EMNLP-CoNLL 2007, 29th June 2007, Prague, Czech. Has a useful CRF overview. See http://www.aclweb.org/anthology-new/D/D07/D07-1068.pdf and http://cl.naist.jp/~yotaro-w/papers/2007/emnlp2007.ppt.
  4. Timothy Weale, 2006. Utilizing Wikipedia Categories for Document Classification.
  5. Nicolas Weber and Paul Buitelaar, 2006. Web-based Ontology Learning with ISOLDE. See http://orestes.ii.uam.es/workshop/4.pdf.
  6. Wikify!, online service to automatically turn selected keywords into Wikipedia-style links. See http://www.wikifyer.com/.
  7. Wikimedia Foundation. 2006. Wikipedia. URL http://en.wikipedia.org/wiki/Wikipedia:Searching.
  8. Wikipedia, full database. See http://download.wikimedia.org.
  9. Wikipedia, general references. See especially (but only a small subset) for the Semantic Web, RDF, RDF Schema, OWL, SPARQL, GRDDL, W3C, Linked Data, many, many ontologies and controlled vocabularies such as FOAF, SKOS, SIOC, Dublin Core, etc; description logic areas (such as FOL), etc., etc. etc.
  10. Wikipedia, (as a) research source; see http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_in_academic_studies.
  11. Fei Wu and Daniel S. Weld, 2007. Autonomously Semantifying Wikipedia.
  12. Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita & Giuseppe Attardi, 2007. Ranking Very Many Typed Entities on Wikipedia, in CIKM ’07: Proceedings of the Sixteenth ACM International Conference on Information and Knowledge Management. See http://grupoweb.upf.es/hugoz/pdf/zaragoza_CIKM07.pdf.
  13. Torsten Zesch, Iryna Gurevych, Max Mühlhäuser, 2007. Analyzing and Accessing Wikipedia as a Lexical Semantic Resource, and the longer technical report. See http://www.ukp.tu-darmstadt.de/software/JWPL.
  14. Torsten Zesch and Iryna Gurevych, 2007. Analysis of the Wikipedia Category Graph for NLP Applications, in Proceedings of the TextGraphs-2 Workshop (NAACL-HLT).
  15. Vinko Zlatic, Miran Bozicevic, Hrvoje Stefancic, and Mladen Domazet. 2006. Wikipedias: Collaborative Web-based Encyclopedias as Complex Networks, in Physical Review E, 74:016115.
Posted:January 28, 2008

Big Graph

AI3 Assembles 26 Candidate Tools

The pending UMBEL subject concept “backbone” ontology will involve literally thousands of concepts. In order to manage and view such a large structure, a concerted effort to find suitable graph visualization software was mounted. This post presents the candidate listing, as well as some useful starting resources and background information.

A subsequent post will present the surprise winner of our evaluation.

Starting Resources

Image:mkbergmanweb.png

Various Example Visualizations

For grins, you may also like to see various example visualizations, most with a large-graph bent:

Software Options

Here is the listing of 26 candidate graph visualization programs assembled to date:

  • Cytoscape – this tool, based on GINY and Piccolo (see below), is under active use by the bioinformatics community and highly recommended by Bio2RDF.org
  • GINY implements a very innovative system for sub-graphing and allows for stunning visuals. GINY is open source, provides a number of layout algorithms, and is designed to be a very intuitive API. Uses Piccolo
  • graphviz – graphviz is a set of graph drawing tools and libraries. It supports hierarchical and mass-spring drawings; although the tools are scalable, their emphasis is on making very good drawings of reasonably-sized graphs. Package components include batch layout filters and interactive editors
  • HyperGraph is an open source project that provides Java code to work with hyperbolic geometry and especially with hyperbolic trees. It provides a very extensible API to visualize hyperbolic geometry, to handle graphs and to layout hyperbolic trees
  • Hypertree is an open source project very similar to the HyperGraph project. As the name implies, Hypertree is restricted to hyperbolic trees
  • The InfoVis Toolkit – is an interactive graphics toolkit written in Java to ease the development of Information Visualization applications and components
  • IsaViz – IsaViz is a visual environment for browsing and authoring Resource Description Framework (RDF) models represented as graphs
  • IVC Software Framework – the InfoVis Cyberinfrastructure (IVC) software framework is a set of libraries that provide a simple and uniform programming-interface to algorithms and user-interface to end-users by leveraging the power of the Eclipse Rich Client Platform (RCP)

  • JGraph – according to its developers, this is the most powerful, easy-to-use, feature-rich and standards-compliant open source graph component available for Java. Many implementation options shown on this screenshots page, including possibly JGraphT
  • JUNG — the Java Universal Network/Graph Framework is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network. It is written in Java, which allows JUNG-based applications to make use of the extensive built-in capabilities of the Java API, as well as those of other existing third-party Java libraries. RDF Gravity uses JUNG
  • LGL – LGL (Large Graph Library) is a compendium of applications for making the visualization of large networks and trees tractable. LGL was specifically motivated by the need to make the visualization and exploration of large biological networks more accessible
  • LibSea – LibSea is both a file format and a Java library for representing large directed graphs on disk and in memory. Scalability to graphs with as many as one million nodes has been the primary goal. Additional goals have been expressiveness, compactness, and support for application-specific conventions and policies
  • Mondrian – is a general purpose statistical data-visualization system written in Java. It features outstanding visualization techniques for categorical data, geographical data and large datasets
  • OpenDX – OpenDX is a uniquely powerful, full-featured software package for the visualization of scientific, engineering and analytical data. OpenDX is the open source software version of IBM’s Visualization Data Explorer Product. The last release of Data Explorer from IBM was 3.1.4B; the open source version is based on this version
  • Otter – Otter is a historical CAIDA tool used for visualizing arbitrary network data that can be expressed as a set of nodes, links or paths. Otter was developed to handle visualization tasks for a wide variety of Internet data, including data sets on topology, workload, performance, and routing. Otter is in maintenance rather than development mode
  • Pajek – Pajek (Slovene word for Spider) is a program, for Windows, for analysis and visualization of large networks. It is freely available for noncommercial use, and has been called by others the “best available”. See also the PDF reference manual for Pajek
  • Piccolo – is a toolkit that supports the development of 2D structured graphics programs, in general, and Zoomable User Interfaces (ZUIs), in particular. It is used to develop full-featured graphical applications in Java and C#, with visual effects such as zooming, animation and multiple representations
  • Prefuse – Prefuse is a user interface toolkit for building highly interactive visualizations of structured and unstructured data. This includes any form of data that can be represented as a set of entities (or nodes) possibly connected by any number of relations (or edges). Examples of data supported by prefuse include hierarchies (organization charts, taxonomies, file systems), networks (computer networks, social networks, web site linkage) and even non-connected collections of data (timelines, scatterplots). See also Jeff Heer, the author of Prefuse (http://jheer.org/)
  • RDF Gravity – RDF Gravity is a tool for visualising RDF/OWL Graphs/ ontologies. It is implemented by using the JUNG Graph API and Jena semantic web toolkit
  • SemaSpace – is a fast and easy to use graph editor for large knowledge networks
  • TouchGraph is a set of interfaces for Graph Visualization using spring-layout and focus+context techniques. Current applications include a utility for organizing links, a visual Wiki Browser, and a Google Graph Browser which uses the Google API; see also the commercial site at http://www.touchgraph.com/
  • TreeMap is a tool for treemap visualisation
  • Tulip – Tulip is a software system dedicated to the visualization of huge graphs. It manages graphs with up to 1 M elements (node and edges) on a personal computer. Its SuperGraph technology architecture provides the following features: 3D visualizations, 3D modifications, plugin support, support for clusters and navigation, automatic graph drawing, automatic clustering of graphs, automatic selection of elements, and automatic coloring of elements according to a metric
  • Visual Browser is a Java application that can visualise the data in RDF schemes
  • Walrus – Walrus is a tool for interactively visualizing large directed graphs in three-dimensional space. By employing a fisheye-like distortion, it provides a display that simultaneously shows local detail and the global context. It is technically possible to display graphs containing a million nodes or more, but visual clutter, occlusion, and other factors can diminish the effectiveness of Walrus as the number of nodes, or the degree of their connectivity, increases. Thus, in practice, Walrus is best suited to visualizing moderately sized graphs that are nearly trees. A graph with a few hundred thousand nodes and only a slightly greater number of links is likely to best target size
  • xSiteable is a complete small-to-medium-size site development kit created in XSLT, with a PHP administration package.