Posted:February 18, 2008

99 Wikipedia Sources Aiding the Semantic Web

Most Comprehensive Reference List Available Shows Impressive Depth, Breadth

Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research. The system is being tapped for both data and structure. Wikipedia has arguably replaced WordNet as the leading lexicon for concepts and relations. Because of its scope and popularity, many argue that Wikipedia is emerging as the de facto structure for classifying and organizing knowledge in the 21st century.

Our work on the UMBEL lightweight reference subject concept structure has stated since the project’s announcement in July 2007 that Wikipedia is a key intended resource for identifying subject concepts and entities. For the past few months I have been scouring the globe attempting to find every drop of research I could find on the use of Wikipedia for semantic Web, information extraction, categorization and related issues.

Thus, I’m pleased to offer up herein the most comprehensive such listing available anywhere: more than 99 resources and counting! (I say “more than” because some entries below have multiple resources; I just liked the sound of 99 as a round number!)

Wikipedia itself maintains a listing of academic studies using Wikipedia as a resource; fewer than one-third of the listings below are on that list (which itself may be an indication of the current state of completeness within Wikipedia). Some bloggers and other sources around the Web also maintain listings in lesser degrees of completeness.

It is well documented the tremendous growth of content and topics within Wikipedia (see, as examples, the W1, W2, W3, W4, W5, W6 and W7 internal Wikipedia sources for gory details), with as of early 2008 about 2.25 million articles in English and versions in 256 languages and variants.

Download access to the full knowledge base has enabled the development of notable core references to the Linked Data aspects of the semantic Web such as DBpedia [5,6] and YAGO [72,73]. Entire research teams, such as Ponzetto and Strube [61-65] (and others as well; see below) are moving toward creating a full-blown ontologies or structured knowledge bases useful for semantic Web purposes based on Wikipedia. So, one of the first and principle uses of Wikipedia to date has been as a data source of concepts, entities and relations.

But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:

Ontology development and categorization
Word sense disambiguation
Named entity recognition
Named entity disambiguation
Semantic relatedness and relations.

These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:

Articles
- First paragraph — Definitions
- Full text — Description of meaning; related terms; translations
- Redirects — Synonymy; spelling variations, misspellings; abbreviations
- Title — Named entities; domain specific terms or senses
- Subject — Category suggestion (phrase marked in bold or in first paragraph)
- Section heading — Category suggestions
Article links
- Context — Related terms; co-occurrences
- Label — Synonyms; spelling variations; related terms
- Target — Link graph; related terms
- LinksTo — Category suggestion
- LinkedBy — Category suggestion
Categories
- Category — Category suggestion
- Contained articles — Semantically related terms (siblings)
- Hierarchy — Hyponymic and meronymic relations between terms
Disambiguation pages
- Article links — Sense inventory
Infobox Templates
- Name —
- Item — Category suggestion; entity suggestion
Lists
- Hyponyms

These are some of the specific uses that are included in the 99 resources listed below.

This is an exciting (and, for most all of us just a few years back, unanticipated) use of the Web in socially relevant and contextual knowledge and research. I’m sure such a listing one year by now will be double in size or larger!

BTW, suggestions for new or overlooked entries are very much welcomed! 🙂

A – B

Sisay Fissaha Adafre and Maarten de Rijke, 2006. Finding Similar Sentences across Multiple Languages in Wikipedia, in EACL 2006 Workshop on New Text–Wikis and Blogs and Other Dynamic Text Sources, April 2006.See http://www.science.uva.nl/~mdr/Publications/Files/eacl2006-similarsentences.pdf.
Sisay Fissaha Adafre, V. Jijkoun and M. de Rijke. Fact Discovery in Wikipedia, in 2007 IEEE/WIC/ACM International Conference on Web Intelligence. See http://staff.science.uva.nl/~mdr/Publications/Files/wi2007.pdf.
Sisay Fissaha Adafre and Maarten de Rijke, 2005. Discovering Missing Links in Wikipedia, in LinkKDD 2005, August 21, 2005, Chicago, IL. See http://data.isi.edu/conferences/linkkdd-05/Download/Papers/linkkdd05-13.pdf.
David Ahn, Valentin Jijkoun, Gilad Mishne, Karin MÃ¼ller, Maarten de Rijke, and Stefan Schlobach. 2004. Using Wikipedia at the TREC QA Track, in Proceedings of TREC 2004.
Sören Auer, Chris Bizer, Jens Lehmann, Georgi Kobilarov, Richard Cyganiak and Zachary Ives, 2007. DBpedia: A nucleus for a web of open data, in Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea, volume 4825 of LNCS, pages 715–728, November 2007. See http://iswc2007.semanticweb.org/papers/ISWC2007_IU_Auer.pdf.
Sören Auer and Jens Lehmann, 2007. What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content, in The Semantic Web: Research and Applications, pages 503-517, 2007. See http://www.eswc2007.org/pdf/eswc07-auer.pdf.
Somnath Banerjee, 2007. Boosting Inductive Transfer for Text Classification Using Wikipedia, at the Sixth International Conference on Machine Learning and Applications (ICMLA). See http://portal.acm.org/citation.cfm?id=1336953.1337115.
Somnath Banerjee, Krishnan Ramanathan, Ajay Gupta, 2007. Clustering Short Texts using Wikipedia, poster presented at Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, pp. 787-788.
F. Bellomi and R. Bonato, 2005. Lexical Authorities in an Encyclopedic Corpus: A Case Study with Wikipedia, online reference not found.
F. Bellomi and R. Bonato, 2005. Network Analysis for Wikipedia, presented at Wikimania 2005; see http://www.fran.it/articles/wikimania_bellomi_bonato.pdf
Bibauw (2005) analyzed the lexicographical structure of Wiktionary (in French).
Abhijit Bhole, Blaž Fortuna, Marko Grobelnik and Dunja Mladenić, 2007.Mining Wikipedia and Relating Named Entities over Time, see http://www.cse.iitb.ac.in/~abhijit.bhole/SiKDD2007ExtractingWikipedia.pdf.
Razvan Bunescu and Marius Pasca. 2006. Using Encyclopedic Knowledge for Named Entity Disambiguation, in Proceedings of the 11th Conference of the EACL, pages 9-16, Trento, Italy.
Razvan Bunescu, 2007. Learning for Information Extraction: From Named Entity Recognition and Disambiguation To Relation Extraction, Ph.D. thesis for the University of Texas, August 2007, 168 pp. See http://oucsace.cs.ohiou.edu/~razvan/papers/thesis-white.pdf.
Luciana Buriol, Carlos Castillo, Debora Donato, Stefano Leonardi, and Stefano Millozzi. 2006. Temporal Analysis of the Wikigraph, in Proceedings of Web Intelligence, Hong Kong.
Davide Buscaldi and Paolo Rosso, 2007. A Comparison of Methods for the Automatic Identification of Locations in Wikipedia, in Proceedings of GIR’07, November 9, 2007, Lisbon, Portugal, pp 89-91. See http://www.dsic.upv.es/~prosso/resources/BuscaldiRosso_GIR07.pdf.

C – F

Ruiz-Casado, M., Alfonseca, E., and Castells, P. 2005. Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets, in AWIC, pages 380-386. See http://nets.ii.uam.es/publications/nlp/awic05.pdf.
Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells, 2006. From Wikipedia to Semantic Relationships: a Semi-automated Annotation Approach, in ESWC2006.
Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells, 2007. Automatising the Learning of Lexical Patterns: an Application to the Enrichment of WordNet by Extracting Semantic Relationships from Wikipedia. See http://nets.ii.uam.es/publications/nlp/dke07.pdf.
Sergey Chernov, Tereza Iofciu, Wolfgang Nejdl, and Xuan Zhou, 2006. Extracting Semantic Relationships between Wikipedia Categories, from SEMWIKI 2006.
Aron Culotta, Andrew McCallum and Jonathan Betz, 2006. Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text, in Proceedings of HLT-NAACL-2006. See http://www.cs.umass.edu/~culotta/pubs/culotta06integrating.pdf
Silviu Cucerzan, 2007. Large-Scale Named Entity Disambiguation Based on Wikipedia Data, in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). See http://www.aclweb.org/anthology-new/D/D07/D07-1074.pdf.
Cyclopedia, from the Cyc Foundation, an online version of Wikipedia that enables browsing the encyclopedia by concepts. See http://www.cycfoundation.org/blog/?page_id=15.
Wisam Dakka and Silviu Cucerzan, 2008. Augmenting Wikipedia with Named Entity Tags, to be published in IJCNLP. See http://research.microsoft.com/users/silviu/Papers/np-ijcnlp08.pdf.
Wisam Dakka and Silviu Cucerzan, 2008. Also, see the online tool http://wikinet.stern.nyu.edu/ (not yet implemented).
Turdakov Denis, 2007. Recommender System Based on User-generated Content. See http://syrcodis.citforum.ru/2007/9.pdf.
EachWiki, online system from Fu et al.
Linyun Fu, Haofen Wang, Haiping Zhu, Huajie Zhang, Yang Wang and Yong Yu, 2007. Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring, at ISWC 2007. See http://iswc2007.semanticweb.org/papers/ISWC2007_RT_Fu.pdf.

G – H

Evgeniy Gabrilovich and Shaul Markovitch. 2006. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge, in AAAI, pages 1301-1306, Boston, MA.
Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, in Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007.
Evgeniy Gabrilovich, 2006. Feature Generation for Textual Information Using World Knowledge, Ph.D. Thesis for The Technion – Israel Institute of Technology, Haifa, Israel, December 2006, 218 pp. See http://www.cs.technion.ac.il/~gabr/papers/phd-thesis.pdf. (There is also an informative video at http://www.researchchannel.org/prog/displayevent.aspx?rID=4915).)
See also Gabrilovich’s Perl Wikipedia tool, WikiPrep.
Rudiger Gleim, Alexander Mehler and Matthias Dehmer, 2007. Web Corpus Mining by Instance of Wikipedia, in Proc. 2nd Web as Corpus Workshop at EACL 2006. See http://acl.ldc.upenn.edu/W/W06/W06-1710.pdf.
Andrew Gregorowicz and Mark A. Kramer, 2006. Mining a Large-Scale Term-Concept Network from Wikipedia, Mitre Technical Report, October 2006. See http://www.mitre.org/work/tech_papers/tech_papers_06/06_1028/06_1028.pdf.
Andreas Harth, Hannes Gassert, Ina O’Murchu, John Breslin and Stefan Decker, 2005. WikiOnt: An Ontology for Describing and Exchanging Wiki Articles, presented at Wikimania, Frankfurt, 5th August 2005. See http://sw.deri.org/~jbreslin/presentations/20050805a.pdf.
Martin Hepp, Daniel Bachlechner and Katharina Siorpaes, 2006. Harvesting Wiki Consensus – Using Wikipedia Entries as Ontology Elements, See http://www.heppnetz.de/files/SemWiki2006-Harvesting%20Wiki%20Consensus-LNCS-final.pdf.
Martin Hepp, Katharina Siorpaes, Daniel Bachlechner, 2007. Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management, in IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 2007. See http://www.heppnetz.de/files/hepp-siorpaes-bachlechner-harvesting%20wikipedia%20w5054.pdf. Also, sample data at http://www.heppnetz.de/harvesting-wikipedia/.
A. Herbelot and Ann Copestake, 2006. Acquiring Ontological Relationships from Wikipedia Using RMRS, in Proc. International Semantic Web Conference 2006 Workshop, Web Content Mining with Human Language Technologies, Athens, GA, 2006. See http://orestes.ii.uam.es/workshop/12.pdf.
Ryuichiro Higashinaka, Kohji Dohsaka and Hideki Isozaki, 2007. Learning to Rank Definitions to Generate Quizzes for Interactive Information Presentation, in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; see pages 117-120 within http://acl.ldc.upenn.edu/P/P07/P07-2.pdf
Todd Holloway, Miran Bozicevic, and Katy BÃ¶rner. 2005. Analyzing and Visualizing the Semantic Coverage of Wikipedia and Its Authors. ArXiv Computer Science e-prints, cs/0512085.
Wei Che Huang, Andrew Trotman, and Shlomo Geva, 2007. Collaborative Knowledge Management: Evaluation of Automated Link Discovery in the Wikipedia, in SIGIR 2007 Workshop on Focused Retrieval, July 27, 2007, Amsterdam, The Netherlands. See http://www.cs.otago.ac.nz/sigirfocus/paper_15.pdf.

I – K

Jonathan Isbell and Mark H. Butler, 2007. Extracting and Re-using Structured Data from Wikis, Hewlett-Packard Technical Report HPL-2007-182, 14th November, 2007, 22 pp. See http://www.hpl.hp.com/techreports/2007/HPL-2007-182.pdf.
Maciej Janik and Krys Kochut, 2007. Wikipedia in Action: Ontological Knowledge in Text Categorization, University of Georgia, Computer Science Department Technical Report no. UGA-CS-TR-07-001. See http://lsdis.cs.uga.edu/~mjanik/UGA-CS-TR-07-001.pdf.
Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath and Gerhard Weikum, 2007. NAGA: Searching and Ranking Knowledge, Technical Report, Max-Planck-Institut f¨ur Informatik, MPI–I–2007–5–001, March 2007, 42 pp.See http://www.mpi-inf.mpg.de/~kasneci/naga/report.pdf
Jun’ichi Kazama and Kentaro Torisawa, 2007. Exploiting Wikipedia as External Knowledge for Named Entity Recognition, in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 698–707, Prague, June 2007.See http://acl.ldc.upenn.edu/D/D07/D07-1073.pdf.
Daniel Kinzler, 2005. WikiSense Mining the Wiki, V 1.1, presented at Wikimania 2005, 10 pp. See http://brightbyte.de/repos/papers/2005/WikiSense-Presentation.pdf.
A. Krizhanovsky, 2006. Synonym Search in Wikipedia: Synarcher, in 11th International Conference “Speech and Computer” SPECOM’2006. Russia, St. Petersburg, June 25-29, 2006, pp. 474-477. See http://arxiv.org/abs/cs/0606097 (also PDF).
Natalia Kozlova, 2005. Automatic Ontology Extraction for Document Classification, a Master’s Thesis for Saarland University, February 2005, 90 pp. See http://domino.mpi-inf.mpg.de/imprs/imprspubl.nsf/80255f02006a559a80255ef20056fc02/3b864d86612739b0c1256fb70042547a/$FILE/Masterarbeit-Kozlova-Nat-2005.pdf. [READ]
M. Krötzsch, D. Vrandečić and M. Völkel, 2005. Wikipedia and the Semantic Web: the Missing Links, at Wikimania 2005.Wikipedia and the Semantic Web: the Missing Links, at Wikimania 2005. See http://citeseer.ist.psu.edu/cache/papers/cs2/143/http:zSzzSzwww.aifb.uni-karlsruhe.dezSzWBSzSzmakzSzpubzSzwikimania.pdf/krotzsch05wikipedia.pdf.

L – N

Rada Mihalcea and Andras Csomai, 2007. Wikify! Linking Documents to Encyclopedic Knowledge, in Proceedings of the Sixteenth ACM conference on Conference on information and knowledge management CIKM ’07 , November 6-8, 2007, pp. 233-241. See ACM portal retrieval (http://portal.acm.org/citation.cfm?id=1321475&coll=Portal&dl=ACM&CFID=8333672&CFTOKEN=48251251).
Rada Mihalcea, 2007. Using Wikipedia for Automatic Word Sense Disambiguation, in Proceedings of NAACL HLT 2007, pages 196–203, April 2007.See http://www.cs.unt.edu/~rada/papers/mihalcea.naacl07.pdf.
David Milne, 2007. Computing Semantic Relatedness using Wikipedia Link Structure; see also the Wikipedia Miner Toolkit (http://sourceforge.net/projects/wikipedia-miner/) provided by the author
D. Milne, O. Medelyan and I. H. Witten, 2006. Mining Domain-Specific Thesauri from Wikipedia: A Case Study, in Proceedings of the International Conference on Web Intelligence (IEEE/WIC/ACM WI’2006), Hong Kong. Also Olena Medelyan’s home page, http://www.cs.waikato.ac.nz/~olena/.
David Milne, Ian H. Witten and David M. Nichols, 2007. A Knowledge-Based Search Engine Powered by Wikipedia, at CIKM ’07.
Lev Muchnik, Royi Itzhack, Sorin Solomon and Yoram Louzoun, 2007. Self-emergence of Knowledge Trees: Extraction of the Wikipedia Hierarchies, in Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), Vol. 76, No. 1.
Nadeau, D., Turney, P., Matwin, S., 2006. Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity, at 19th Canadian Conference on Artificial Intelligence. Québec City, Québec, Canada. June 7, 2006.See http://www.iit-iti.nrc-cnrc.gc.ca/iit-publications-iti/docs/NRC-48727.pdf. Doesn’t specifically use Wikipedia, but techniques are applicable.
Kotaro Nakayama, Takahiro Hara and Shojiro Nishio, 2007. Wikipedia Mining for an Association Web Thesaurus Construction, in Web Information Systems Engineering – WISE 2007, Vol. 4831 (2007), pp. 322-334.
Dat P.T. Nguyen, Yutaka Matsuo and Mitsuru Ishizuka, 2007. Relation Extraction from Wikipedia Using Subtree Mining, from AAAI ’07.

O – P

Yann Ollivier and Pierre Senellar, 2007. Finding Related Pages Using Green Measures: An Illustration with Wikipedia, in Proceedings of the AAAI-07 Conference. See http://pierre.senellart.com/publications/ollivier2006finding.pdf. See also http://pierre.senellart.com/publications/ollivier2006finding/ for tools and data.
Simon Overell and Stefan Ruger, 2006. Identifying and Grounding Descriptions of Places, in SIGIR Workshop on Geographic Information Retrieval, pages 14–16, 2006.See http://mmis.doc.ic.ac.uk/www-pub/sigir06-GIR.pdf.
Simone Paolo Ponzetto and Michael Strube, 2006. Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution, in NAACL 2006. See http://www.eml-research.de/english/homes/strube/papers/naacl06.pdf.
Simone Paolo Ponzetto and Michael Strube, 2007a. Deriving a Large Scale Taxonomy from Wikipedia, in Association for the Advancement of Artificial Intelligence (AAAI2007).
Simone Paolo Ponzetto and Michael Strube, 2007b. Knowledge Derived From Wikipedia For Computing Semantic Relatedness, in Journal of Artificial Intelligence Research 30 (2007) 181-212. See also these PPT slides (in PDF format): Part I, Part II, and References.
Simone Paolo Ponzetto and Michael Strube, 2007c. An API for Measuring the Relatedness of Words in Wikipedia, in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, See pages 49-52 within http://acl.ldc.upenn.edu/P/P07/P07-2.pdf
Simone Paolo Ponzetto, 2007. Creating a Knowledge Base from a Collaboratively Generated Encyclopedia, in Proceedings of the NAACL-HLT 2007 Doctoral Consortium, pp 9-12, Rochester, NY, April 2007. See http://www.aclweb.org/anthology-new/N/N07/N07-3003.pdf.

R – S

Tyler Riddle. 2006. Parse::mediawikidump. URL http://search.cpan.org/~triddle/Parse-MediaWikiDump-0.40/.
Ralf Schenkel, Fabian Suchanek and Gjergji Kasneci, 2007. YAWN: A Semantically Annotated Wikipedia XML Corpus, in BTW2007.
Péter Schönhofen, 2006. Identifying Document Topics Using the Wikipedia Category Network, in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 456-462, 2006.See http://amber.exp.sis.pitt.edu/gale/paper/Identifying%20document%20topics%20using%20the%20Wikipedia%20category%20network.ppt.
Börkur Sigurbjörnsson, Jaap Kamps, and Maarten de Rijke. 2006. Focused Access to Wikipedia. In Proceedings DIR-2006.
Suzette Kruger Stoutenburg, 2007. Research Proposal: Acquiring the Semantic Relationships of Links between Wikipedia Articles, class proposal to http://www.coloradostoutenburg.com/cs589%20stoutenburg%20project%20proposal%20v1.doc
Strube, M. and Ponzetto, S. P. 2006. WikiRelate! Computing Semantic Relatedness Using Wikipedia, in AAAI, pages 1419-1424, Boston, MA.
Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, Yago – A Core of Semantic Knowledge, in 16th international World Wide Web Conference (WWW 2007), Banff, Alberta. See http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/.
Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, 2007. Yago: A Large Ontology from Wikipedia and WordNet, in Technical Report, submitted to the Elsevier Journal of Web Semantics 67 pp. See http://www.mpi-inf.mpg.de/~suchanek/publications/yagotr.pdf.
Fabian M. Suchanek, Georgiana Ifrim and Gerhard Weikum, 2006. Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents, in Knowledge Discovery and Data Mining (KDD 2006). See http://www.mpi-inf.mpg.de/~ifrim/publications/kdd2006.pdf.
S. Suh, H. Halpin and E. Klein, 2006. Extracting Common Sense Knowledge from Wikipedia, in Proc. International Semantic Web Conference 2006 Workshop, Web Content Mining with Human Language Technologies, Athens, GA, 2006. See http://orestes.ii.uam.es/workshop/22.pdf.
Zareen Syed, Tim Finin, and Anupam Joshi, 2008. Wikipedia as an Ontology for Describing Documents, from Proceedings of the Second International Conference on Weblogs and Social Media, AAAI, March 31, 2008. See http://ebiquity.umbc.edu/paper/html/id/383/Wikipedia-as-an-Ontology-for-Describing-Documents.

T – V

J. A. Thom, J, Pehcevski and A. M. Vercoustre, 2007. Use of Wikipedia Categories in Entity Ranking. in Proceedings of the 12th Australasian Document Computing Symposium, Melbourne, Australia (2007). See http://arxiv.org/PS_cache/arxiv/pdf/0711/0711.2917v1.pdf.
Antonio Toral and Rafael Muñoz, 2007. Towards a Named Entity Wordnet (NEWN), in Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). pp. 604-608 . September 2007.See http://www.dlsi.ua.es/~atoral/publications/2007_ranlp_newn_poster.pdf.
Antonio Toral and Rafael Muñoz, 2006. A Proposal to Automatically Build and Maintain Gazetteers for Named Entity Recognition by using Wikipedia, in Workshop on New Text, 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento (Italy). April 2006.See http://www.dlsi.ua.es/~atoral/publications/2006_eacl-newtext_wiki-ner_paper.pdf.
Anne-Marie Vercoustre, Jovan Pehcevski and James A. Thom, 2007. Using Wikipedia Categories and Links in Entity Ranking, in Pre-proceedings of the Sixth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2007), Dec 17, 2007. See http://hal.inria.fr/docs/00/19/24/89/PDF/inex07.pdf.
Anne-Marie Vercoustre, James A. Thom and Jovan Pehcevski, 2008. Entity Ranking in Wikipedia, in SAC’08 March 16-20, 2008, Fortaleza, Ceara, Brazil. See http://arxiv.org/PS_cache/arxiv/pdf/0711/0711.3128v1.pdf.
Max Völkel, Markus Krötzsch, Denny Vrandecic, Heiko Haller and Rudi Studer, 2006. Semantic Wikipedia, in Proceedings of WWW2006, pp 585-594.See http://www.aifb.uni-karlsruhe.de/WBS/hha/papers/SemanticWikipedia.pdf
Jakob Voss. 2006. Collaborative Thesaurus Tagging the Wikipedia Way. ArXiv Computer Science e-prints, cs/0604036. See http://arxiv.org/abs/cs.IR/0604036
Denny Vrandecic, Markus Krötzsch and Max Völkel, 2007. Wikipedia and the Semantic Web, Part II, in Phoebe Ayers and Nicholas Boalch, Proceedings of Wikimania 2006 – The Second International Wikimedia Conference, Wikimedia Foundation, Cambridge, MA, USA, August 2007.See http://wikimania2006.wikimedia.org/wiki/Proceedings:DV1.

W – Z

Wang Y , Wang H , Zhu H , Yu Y, 2007. Exploit Semantic Information for Category Annotation Recommendation in Wikipedia, in Natural Language Processing and Information Systems (2007), pp. 48-60.
Gang Wang, Yong Yu and Haiping Zhu, 2007. PORE: Positive-Only Relation Extraction from Wikipedia Text, at ISWC 2007, See http://iswc2007.semanticweb.org/papers/ISWC2007_RT_Wang(1).pdf.
Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto, 2007. A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, in EMNLP-CoNLL 2007, 29th June 2007, Prague, Czech. Has a useful CRF overview. See http://www.aclweb.org/anthology-new/D/D07/D07-1068.pdf and http://cl.naist.jp/~yotaro-w/papers/2007/emnlp2007.ppt.
Timothy Weale, 2006. Utilizing Wikipedia Categories for Document Classification.
Nicolas Weber and Paul Buitelaar, 2006. Web-based Ontology Learning with ISOLDE. See http://orestes.ii.uam.es/workshop/4.pdf.
Wikify!, online service to automatically turn selected keywords into Wikipedia-style links. See http://www.wikifyer.com/.
Wikimedia Foundation. 2006. Wikipedia. URL http://en.wikipedia.org/wiki/Wikipedia:Searching.
Wikipedia, full database. See http://download.wikimedia.org.
Wikipedia, general references. See especially (but only a small subset) for the Semantic Web, RDF, RDF Schema, OWL, SPARQL, GRDDL, W3C, Linked Data, many, many ontologies and controlled vocabularies such as FOAF, SKOS, SIOC, Dublin Core, etc; description logic areas (such as FOL), etc., etc. etc.
Wikipedia, (as a) research source; see http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_in_academic_studies.
Fei Wu and Daniel S. Weld, 2007. Autonomously Semantifying Wikipedia.
Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita & Giuseppe Attardi, 2007. Ranking Very Many Typed Entities on Wikipedia, in CIKM ’07: Proceedings of the Sixteenth ACM International Conference on Information and Knowledge Management. See http://grupoweb.upf.es/hugoz/pdf/zaragoza_CIKM07.pdf.
Torsten Zesch, Iryna Gurevych, Max MÃ¼hlhÃ¤user, 2007. Analyzing and Accessing Wikipedia as a Lexical Semantic Resource, and the longer technical report. See http://www.ukp.tu-darmstadt.de/software/JWPL.
Torsten Zesch and Iryna Gurevych, 2007. Analysis of the Wikipedia Category Graph for NLP Applications, in Proceedings of the TextGraphs-2 Workshop (NAACL-HLT).
Vinko Zlatic, Miran Bozicevic, Hrvoje Stefancic, and Mladen Domazet. 2006. Wikipedias: Collaborative Web-based Encyclopedias as Complex Networks, in Physical Review E, 74:016115.

Posted:February 3, 2008

Linked Data Comes of Age

LinkedData Planet and LDOW Set the Pace for 2008

Linked Data follows recommended practices for identifying, exposing and connecting data on the semantic Web. A robust Linking Open Data (LOD) community has rapidly developed around the practice since its approval as a formal project of the W3C’s Semantic Web Education and Outreach (SWEO) Interest Group in March 2007. Though counts rapidly become dated, today, in less than a year, the size of the Linked Data on the Web exceeds several billion RDF triples.

This foundation of interlinkable data comes from the highest value reference sources available, and includes most notable place, people, event, book, music, cultural, language and government entities. The following official figure of the LOD community, maintained by one of its founders, Richard Cyganiak, is updated frequently (click on the figure below to get the most recent interactive version), and shows well the breadth of this data value:

It would be putting it mildly to say that the LOD project has been a roaring success. New initiatives like the Billion Triple Challenge will continue to rapidly push forward the size and frontiers of Linked Data. We also have two signal events coming up in 2008 that demonstrate just how much Linked Data is coming of age.

LinkedData Planet Conference

The newly announced LinkedData Planet Conference and Expo being held in New York City on June 17th and 18th is notable for a number of reasons (besides Tim Berners-Lee being the special keynote speaker).

First, the conference represents the first direct exposure of Linked Data to the business and enterprise community. For years, the semantic Web community has largely been an academic one with its own set of meetings and venues. That began to change with the recent series on the Semantic Technology Conferences, held as usual in San Jose in May (this year’s is May 18-22). That meeting in 2007 drew more than 800 attendees and marked the first time that significant presence from the academic and research communities occurred.

Though valuable and chock-a-block with enterprise case studies, the Semantic Technology conference also is challenged by the amorphous understanding of what is the “semantic Web”. Reaching common understandings and getting cross-fertilization between the business and research communities can be a challenge.

Second, the LinkedData Planet Conference is occurring in NYC with an anticipated strong participation from East Coast financial interests. Like the Semantic Technology Conferences, it is important that the nascent technologies and applications supporting Linked Data receive the venture and funding attention they deserve.

Third, Jupitermedia is the event manager for the conference. Jupitermedia has a long history in producing quality industry events such as Internet World, Search Engine Strategies and ISPcon. Its meetings typically blend excellent content with strong community outreach and participation.

But, last, and to my mind most important, the very topic of Linked Data is very focused and pragmatic. There are real methods, real techniques and real applications available now to take advantage of Linked Data now. The business community need not wait on full semantics and total data exposure and automation in order to receive real value today.

Last July I wrote a piece entitled, More Structure, More Terminology and (hopefully) More Clarity. It, and related posts on the structured Web, had as its thesis that the Web was naturally evolving from a document-centric basis to a “Web of Data”. We already have much structured data available and the means through RDFizers and other techniques to convert that structure to Linked Data. Linked Data thus represented a very doable and pragmatic way station on the road to the semantic Web. It is a journey we can take today; indeed, many already are as the growth figures noted attest.

Here is a repeat of the diagram I used last July to make that argument (now highlighting the Linked Data phase in red):


Document Web	Structured Web		Semantic Web
		Linked Data
Document-centric Document resources Unstructured data and semi-structured data HTML URL-centric circa 1993	Data-centric Structured data Semi-structured data and structured data XML, JSON, RDF, etc URI-centric circa 2003	Data-centric Linked data Semi-structured data and structured data RDF, RDF-S URI-centric circa 2007	Data-centric Linked data Semi-structured data and structured data RDF, RDF-S, OWL URI-centric circa ???

Hopefully, the LinkedData Planet Conference will act as a similar catalyst within the business community as the LOD project has been in the research one. And, hopefully, academia, research, venture and business interests can all come together over these two days to exploit the Linked Data value so readily at hand.

LDOW at WWW2008

Another signal event coming up is the Linked Data on the Web (LDOW) workshop at the 17th International World Wide Web Conference (WWW2008) in Bejing. LDOW is a full-day session involving a mix of papers and demos on April 22. A significant roster of very interesting submissions has been made (disclosure: I am both on the program committee and a submitting author).

Linked Data was arguably the highlight at WWW2007 in Banff, and LDOW will certainly show just how far this approach has come in one short year. LDOW will likely provide a preview of many of the applications and approaches that will receive fuller attention at the LinkedData Planet Conference two months later.

* * * *

It is exciting to see Linked Data emerging as today’s pragmatic focus for bringing further structure, connections and semantics to the Web. For those of you new to the concept, I encourage you to become active and involved in 2008. And, for those of you already active, I look forward personally to working with you further in the coming year.

Posted:January 28, 2008

Cytoscape: Hands-down Winner for Large-scale Graph Visualization

Where Has the Biology Community Been Hiding this Gem?

Jewels & Doubloons I still never cease to be amazed at how wonderful and powerful tools are so often and easily overlooked. The most recent example is Cytoscape, a winner in our recent review of more than 25 tools for large-scale RDF graph visualization.

We began this review because the UMBEL subject concept “backbone” ontology will involve literally thousands of concepts. Graph visualization software suitable to very large graphs would aid UMBEL’s construction and refinement.

Cytoscape describes itself as a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. Cytoscape is partially based on GINY and Piccolo, among other open-source toolkits. What is more important to our immediate purposes, however, is that its design also lends itself well to general network and graph manipulation.

Cytoscape was first brought to our attention by François Belleau of Bio2RDF.org. Thanks François, and also for the strong recommendation and tips. Special thanks are also due to Frédérick Giasson of Zitgist for his early testing and case examples. Thanks, Fred!

Requirements

We had a number of requirements and items on our wish list prior to beginning our review. We certainly did not expect most or all of these items to be met:

Large scale – the UMBEL graph will likely have about 20,000 nodes or so; we would also like to be able to scale to instance graphs of hundreds of thousands or millions of nodes. For example, here is one representation of the full UMBEL graph (with nodes in pink-orange and different colored lines representing different relationships or predicates):

Graph filtering – the ability to filter out the graph display by attribute, topology, selected nodes or other criteria. Again, here is an example using the ‘Organic’ layout produced by selecting on the Music node in UMBEL (click for full size):

Graph analysis – the ability to analyze edge (or relation) lengths, cyclic aspects, missing nodes, imbalances across the full graph, etc.
Extensibility – the ability to add new modules or plugins to the system
Support for RDF – the ease for direct incorporation of RDF graphs
Graph editing – the interactive ability to add, edit or modify nodes and relations, to select colors and display options, to move nodes to different locations, cut-and-past operations and other standard edits, and
Graph visualization – the ease of creating sub-graphs and to plot the graphs with a variety of layout options.

Cytoscape met or exceeded our wish list in all areas save one: it does not support direct ingest of RDF (other than some pre-set BioPAX formats). However, that proved to be no obstacle because of the clean input format support of the tool. Simple parsing of triples into a CSV file is sufficient for input. Moreover, as described below, there are other cool attribute management functions that this clean file format supports as well.

Features and Attractions

The following screen shot shows the major Cytoscape screen. We will briefly walk through some of its key views (click for full size):

This Java tool has a fairly standard Eclipse-like interface and design. The main display window (A) shows the active portion of the current graph view. (Note that in this instance we are looking at a ‘Spring’ layout for the same Music sub-graph presented above.) Selections can easily be made in this main display (the red box) or by directly clicking on a node. The display itself represents a zoom (B) of the main UMBEL graph, which can also be easily panned (the blue box on B) or itself scaled (C). Those items that are selected in the main display window also appear as editable nodes or edges and attributes in the data editing view (D).

The appearance of the graph is fully editable via the VizMapper (E). An interesting aspect here is that every relation type in the graph (its RDF properties, or predicates) can be visually displayed in a different manner. The graphs or sub-graphs themselves can be selected, but also most importantly, the display can respond to a very robust and flexible filtering framework (F). Filters can be easily imported and can apply to nodes, edges (relations), the full graph or other aspects (depending on plugin). A really neat feature is the ability to search the graph in various flexible ways (G), which alters the display view. Any field or attribute can be indexed for faster performance.

In addition to these points, Cytoscape supports the following features:

Load and save previously-constructed interaction networks in GML format (Graph Markup Language)
Load and save networks and node/edge attributes in an XML document format called XGMML (eXtensible Graph Markup and Modeling Language)
Load and save arbitrary attributes on nodes and edges. For example, input a set of custom annotation terms or confidence values
Load and save state of the Cytoscape session in a Cytoscape Session (.cys) file. Cytoscape Session file includes networks, attributes (for node/edge/network), desktop states (selected/hidden nodes and edges, window sizes), properties, and visual styles (which are namable)
Customize network data display using powerful visual styles
Map node color, label, border thickness, or border color, etc. according to user-configurable colors and visualization schemes
Layout networks in two dimensions. A variety of layout algorithms are available, including cyclic and spring-embedded layouts
Zoom in/out and pan for browsing the network
Use the network manager to easily organize multiple networks, with this structure savable in a session file
Use the bird’s eye view to easily navigate large networks
Easily navigate large networks (100,000+ nodes and edges) by efficient rendering engine
Multiple plugins are available for areas such as subset selections, analysis, path analysis, etc. (see below).

Other Cytoscape Resources

The Cytoscape project also offers:

An excellent PDF manual
Online tutorials, and
A Google discussion group at http://groups.google.com/group/cytoscape-discuss/topics.

Unfortunately, other than these official resources, there appears to be a dearth of general community discussion and tips on the Web. Here’s hoping that situation soon changes!

Plugins

There is a broad suite of plugins available for Cytoscape, and directions to developers for developing new ones.

The master page also includes third-party plugins. The candidates useful to UMBEL and its graphing needs — also applicable to standard semantic Web applications — appear to be:

AgilentLiteratureSearch – creates a CyNetwork based on searching the scientific literature. Download from here
BubbleRouter – this plugin allows users to layout a network incrementally and in a semi-automated way. Bubble Router arranges specific nodes in user-drawn regions based on a selected attribute value. Bubble Router works with any node attribute file. Download from here
Cytoscape Plugin (Oracle) – enables a read/write interface between the Oracle database and the Cytoscape program. In addition, it also enables some network analysis functions from cytoscape. The README.txt file within the zipfile has instructions for installing and using this plugin. Download from here
DOT – interfaces with the GraphViz package for graph layout. The plugin now supports both simple and rank-cluster layouts. This software uses the dot layout routine from the graphviz opensource software developed at AT&T labs. Download from here
EnhancedSearch – performs search on multiple attribute fields. Download from here
HyperEdgeEditor – add, remove, and modify HyperEdges in a Cytoscape Network. Download from here
MCODE – MCODE finds clusters (highly interconnected regions) in a network. Clusters mean different things in different types of networks. For instance, clusters in a protein-protein interaction network are often protein complexes and parts of pathways, while clusters in a protein similarity network represent protein families. Download from here
MONET – is a genetic interaction network inference algorithm based on Bayesian networks, which enables reliable network inference with large-scale data(ex. microarray) and genome-scale network inference from expression data. Network inference can be finished in reasonable time with parallel processing technique with supercomputing center resources. This option may also be applicable to generic networks. Download from here
NamedSelection – this plugin provides the ability to “remember” a group of selected nodes. Download from here
NetworkAnalyzer – computes network topology parameters such as diameter, average number of neighbors, and number of connected pairs of nodes. It also displays diagrams for the distributions of node degrees, average clustering coefficients, topological coefficients, and shortest path lengths. Download: http://med.bioinf.mpi-inf.mpg.de/netanalyzer/index.html
SelConNet – is used to select the connected part of a network. Actually, running this plugin is like calling Select -> Nodes -> First neighbors of selected nodes many times until all the connected part of the network containing the selected nodes is selected. Download from here
ShortestPath – is a plugin for Cytoscape 2.1 and later to show the shortest path between 2 selected nodes in the current network. It supports both directed and undirected networks and it gives the user the possibility to choose which node (of the selected ones) should be used as source and target (useful for directed networks). The plugin API makes possible to use its functionality from another plugin. Download from here
sub-graph – is a flexible sub-graph creation, node identification, cycle finder, and path finder in directed and undirected graphs. It also has a function to select the p-neighborhood of a selected node or group of nodes which can be selected by pathway name, node type, or by a list in a file. This generates a set of plug-ins called: path and cycle finding, p-neighborhoods and sub-graph selection. Download from here.

Importantly, please note there is a wealth of biology- and molecular-specific plugins also available that are not included in the generic listing above.

Initial Use Tips

Our initial use of the tool suggests some use tips:

Try Cytoscape with the yFiles layouts; quicker to perform, and interesting results
Try the Organic yFile layout as one of the first
Try the search feature
Check the manual for examples of layouts
Holding the right mouse button down when in the main screen; moving the cursor from the center outward causes zoom in, from the exterior inward, to zoom out
Moving and panning nodes can be done in real time without issues
The “edge attribute browser” is really nice to find what node links to what other node by clicking on a link (so you don’t have to pan and check, etc)
Export to PDF often works best as an output display (though SVG is also supported)
If you select an edge and then Ctrl-left-click on the edge, an edge “handle” will appear. This handle can be used to change the shape of the line
Use the CSV file to make quick modifications, and then check it with the Organic layout
A convenient way to check the propagation of a network is to select a node, then click on Ctrl+6 again and again (Ctrl+6 selects neighborhood nodes of a selected node, so it “shows” you the network created by a node and its relationships)
If you want to analyze a sub-graph, search for a node, then press a couple of times on Ctrl+6, then create another graph from that selected node (File -> New -> Network from selected node)
If you begin to see slow performance, then save and re-load your session; there appears to be some memory leaks in the program
Also, for very large graphs, avoid repeated use of certain layouts (Hierarchy, Orthogonal, etc.) that take very long times to re-draw.

Concluding Observations and Comments

Cytoscape was first released in 2002 and has undergone steady development since. Most recently, the 2.x and especially 2.3 versions forward have seen a flurry of general developments that have greatly broadened the tool’s appeal and capabilities. It was perhaps only these more recent developments that have positioned Cytoscape for broader use.

I suspect another reason that this tool has been overlooked by the general semWeb community is the fact that its sponsors have positioned it mostly in the biological space. Their short descriptor for the project, for example, is: Cytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. That statement hardly makes it sound like a general tool!

Another reason for the lack of attention, of course, is the common tendency for different disciplines not to share enough information. Indeed, one reason for my starting the Sweet Tools listing was hopefully as a means of overcoming artificial boundaries and assembling relevant semantic Web tools in one central place.

Yet despite the product’s name and its positioning by sponsors, Cytoscape is indeed a general graph visualization tool, and arguably the most powerful one reviewed from our earlier list. Cytoscape can easily accommodate any generalized graph structure, is scalable, provides all conceivable visualization and modeling options, and has a clean extension and plugin framework for adding specialized functionality.

With just minor tweaks or new plugins, Cytoscape could directly read RDF and its various serializations, could support processing any arbitrary OWL or RDF-S ontology, and could support other specific semWeb-related tasks. As well, a tool like CPath (http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1660554), which enables querying of biological databases and then storing them in Cytoscape format, offers some tantalizing prospects for a general model for other Web query options.

For these reasons, I gladly announce Cytoscape as the next deserving winner of the (highly coveted, but cheesy! ) AI3 Jewels & Doubloons award.

Cytoscape’s sponsors — the U.S. National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH), the U.S. National Science Foundation (NSF) and Unilever PLC — and its developers — the Institute for Systems Biology, the University of California – San Diego, the Memorial Sloan-Kettering Cancer Center, L’Institut Pasteur and Agilent Technologies — are to be heartily thanked for this excellent tool!

An AI3 Jewels & Doubloons Winner

Large-scale RDF Graph Visualization Tools

AI3 Assembles 26 Candidate Tools

The pending UMBEL subject concept “backbone” ontology will involve literally thousands of concepts. In order to manage and view such a large structure, a concerted effort to find suitable graph visualization software was mounted. This post presents the candidate listing, as well as some useful starting resources and background information.

A subsequent post will present the surprise winner of our evaluation.

Starting Resources

See http://www.visualcomplexity.com/vc/. Viewing the various graphs under Knowledge Networks shows some exemplars, that also often leads to a citation of the software used to create the graph
Information Aesthetics has nice daily charts and an OK search facility
The InfoVis:Wiki provides a community platform and forum integrating recent developments and news on all areas and aspects of Information Visualization
See also “Network Visualization by Semantic Substrates” with examples and tools (PDF)
NetworkViz is a project hosted by Sourceforge that has been intended to be a melting pot for all open source projects related to graph visualisation
There is an interesting essay on text visualization and many graphics examples
WikiViz is a nice wiki for information visualization and tools
A great example of “elastomeric” or spring graphs, at which you can model any arbitrary Web site, is http://www.aharef.info/static/htmlgraph/. (The basic graphic program behind this is http://www.cs.princeton.edu/~traer/physics/ and the Processing software.) The example below is for this AI3 blog (also compare to my first posting nearly two years ago on this neat tool):

Various Example Visualizations

For grins, you may also like to see various example visualizations, most with a large-graph bent:

clusterball wikipedia categories
Chris Harrison’s similar WikiViz
visuwords graphical dictionary open source using WordNet; online and downloadable
mnemomap search mapping
WebBrain hierarchical search visualization; not too different
semaspace semantic network
javascript visual wordnet
regular expression visualizer.

Software Options

Here is the listing of 26 candidate graph visualization programs assembled to date:

Cytoscape – this tool, based on GINY and Piccolo (see below), is under active use by the bioinformatics community and highly recommended by Bio2RDF.org
GINY implements a very innovative system for sub-graphing and allows for stunning visuals. GINY is open source, provides a number of layout algorithms, and is designed to be a very intuitive API. Uses Piccolo
graphviz – graphviz is a set of graph drawing tools and libraries. It supports hierarchical and mass-spring drawings; although the tools are scalable, their emphasis is on making very good drawings of reasonably-sized graphs. Package components include batch layout filters and interactive editors
HyperGraph is an open source project that provides Java code to work with hyperbolic geometry and especially with hyperbolic trees. It provides a very extensible API to visualize hyperbolic geometry, to handle graphs and to layout hyperbolic trees
Hypertree is an open source project very similar to the HyperGraph project. As the name implies, Hypertree is restricted to hyperbolic trees
The InfoVis Toolkit – is an interactive graphics toolkit written in Java to ease the development of Information Visualization applications and components
IsaViz – IsaViz is a visual environment for browsing and authoring Resource Description Framework (RDF) models represented as graphs

IVC Software Framework – the InfoVis Cyberinfrastructure (IVC) software framework is a set of libraries that provide a simple and uniform programming-interface to algorithms and user-interface to end-users by leveraging the power of the Eclipse Rich Client Platform (RCP)

JGraph – according to its developers, this is the most powerful, easy-to-use, feature-rich and standards-compliant open source graph component available for Java. Many implementation options shown on this screenshots page, including possibly JGraphT
JUNG — the Java Universal Network/Graph Framework is a software library that provides a common and extendible language for the modeling, analysis, and visualization of data that can be represented as a graph or network. It is written in Java, which allows JUNG-based applications to make use of the extensive built-in capabilities of the Java API, as well as those of other existing third-party Java libraries. RDF Gravity uses JUNG
LGL – LGL (Large Graph Library) is a compendium of applications for making the visualization of large networks and trees tractable. LGL was specifically motivated by the need to make the visualization and exploration of large biological networks more accessible
LibSea – LibSea is both a file format and a Java library for representing large directed graphs on disk and in memory. Scalability to graphs with as many as one million nodes has been the primary goal. Additional goals have been expressiveness, compactness, and support for application-specific conventions and policies
Mondrian – is a general purpose statistical data-visualization system written in Java. It features outstanding visualization techniques for categorical data, geographical data and large datasets
OpenDX – OpenDX is a uniquely powerful, full-featured software package for the visualization of scientific, engineering and analytical data. OpenDX is the open source software version of IBM’s Visualization Data Explorer Product. The last release of Data Explorer from IBM was 3.1.4B; the open source version is based on this version
Otter – Otter is a historical CAIDA tool used for visualizing arbitrary network data that can be expressed as a set of nodes, links or paths. Otter was developed to handle visualization tasks for a wide variety of Internet data, including data sets on topology, workload, performance, and routing. Otter is in maintenance rather than development mode
Pajek – Pajek (Slovene word for Spider) is a program, for Windows, for analysis and visualization of large networks. It is freely available for noncommercial use, and has been called by others the “best available”. See also the PDF reference manual for Pajek
Piccolo – is a toolkit that supports the development of 2D structured graphics programs, in general, and Zoomable User Interfaces (ZUIs), in particular. It is used to develop full-featured graphical applications in Java and C#, with visual effects such as zooming, animation and multiple representations
Prefuse – Prefuse is a user interface toolkit for building highly interactive visualizations of structured and unstructured data. This includes any form of data that can be represented as a set of entities (or nodes) possibly connected by any number of relations (or edges). Examples of data supported by prefuse include hierarchies (organization charts, taxonomies, file systems), networks (computer networks, social networks, web site linkage) and even non-connected collections of data (timelines, scatterplots). See also Jeff Heer, the author of Prefuse (http://jheer.org/)
RDF Gravity – RDF Gravity is a tool for visualising RDF/OWL Graphs/ ontologies. It is implemented by using the JUNG Graph API and Jena semantic web toolkit
SemaSpace – is a fast and easy to use graph editor for large knowledge networks
TouchGraph is a set of interfaces for Graph Visualization using spring-layout and focus+context techniques. Current applications include a utility for organizing links, a visual Wiki Browser, and a Google Graph Browser which uses the Google API; see also the commercial site at http://www.touchgraph.com/
TreeMap is a tool for treemap visualisation
Tulip – Tulip is a software system dedicated to the visualization of huge graphs. It manages graphs with up to 1 M elements (node and edges) on a personal computer. Its SuperGraph technology architecture provides the following features: 3D visualizations, 3D modifications, plugin support, support for clusters and navigation, automatic graph drawing, automatic clustering of graphs, automatic selection of elements, and automatic coloring of elements according to a metric
Visual Browser is a Java application that can visualise the data in RDF schemes
Walrus – Walrus is a tool for interactively visualizing large directed graphs in three-dimensional space. By employing a fisheye-like distortion, it provides a display that simultaneously shows local detail and the global context. It is technically possible to display graphs containing a million nodes or more, but visual clutter, occlusion, and other factors can diminish the effectiveness of Walrus as the number of nodes, or the degree of their connectivity, increases. Thus, in practice, Walrus is best suited to visualizing moderately sized graphs that are nearly trees. A graph with a few hundred thousand nodes and only a slightly greater number of links is likely to best target size
xSiteable is a complete small-to-medium-size site development kit created in XSLT, with a PHP administration package.

Posted:January 26, 2008

The Semantic Web and Industry Standards

Proprietary is a Four-Letter Word

I wrote the following in response to a recent press inquiry asking me to define the terms “semantic Web” and “industry standards”. It got me to thinking about how some new companies are misappropriating the terms:

The “Semantic Web” is a vision first promoted by Tim Berners-Lee, founder of the WWW and director of the W3C standards consortium [1,2,3]. In its full sense, understood to require many years to reach fruition, today’s document Web evolves into a Web of data with machines being able to understand the meaning of that data and to interoperate and take action on it, performing many useful tasks for people such as finding relevant and desired information and doing and interconnecting stuff automatically. This longer-term vision is often expressed as the “uppercase” Semantic Web.

Nearer term, the evolution to a Web of data still occurs but the aspirations are more immediate and at hand. Important Web data is broken out and expressed in ways that aid interconnections and interoperability. Key sources, like Wikipedia and Census data and much else is now expressed at the more atomic data and object (as opposed to Web page) level, that leads to meaningful linkages and interoperability.

This partial vision, also supported by Berners-Lee (and, of course, many others), is being demonstrated by the linked data initiative [4], bringing meaningful results to both machines and humans, and is often called the “lowercase” semantic Web. Others have also called this phase in the Web’s evolution “Web 3.0” (a phrase I dislike however because it conveys little meaning nor compliance to any standards).

Many wonderful and dedicated people have been working towards these visions for a decade or more. Some adhere more to the “pure” uppercase expression of the vision; others are more near-term and pragmatic lowercase in nature. The press sometimes likes to see these differences in viewpoint as expressions of controversy or dispute in the community, but, to my own lights, I think they are more differences in perspective than objectives.

The common thread is the “semantics”, or the meaning, of the data. If we know that two pieces of information or data are related in meaning than we can act accordingly upon them.

In any case, the mechanisms by which semantic interoperability occur are via standards, nearly all developed and promulgated by the W3C. Key semantic Web standards include URIs (of course), Resource Description Framework (RDF) [5] that defines the “triples” of how to express data relationships between subjects and objects (the two pieces of data), RDF Schema or the Web ontology language (OWL) [6] for how to describe data domains and their structure and vocabularies, GRDDL [7] for converting common information to RDF, SPARQL [8] for how to query compliant semantic data stores, and of course many others.

By “industry compliant”, we mean that it conforms to all of these open standards guiding this evolution to the Semantic Web. And, obviously, via this compliance, we are then able to easily interoperate with others that also so conform.

While there are certainly cases and issues where I disagree with the definitions or specific uses of semantic Web concepts by the World Wide Web consortium (W3C), without these standards there would be chaos and no interoperability moving forward.

So, while I think it is fair game to criticize and lobby for changes in the W3C’s promulgations, it is not “compliant” to not use the standards. Beware of emerging companies that claim the mantle of the semantic Web — or worse, still, push the fluff of Web 3.0 — but do not adhere to these standards. For in the end, they are not the semantic Web, but just another example of the proprietary dinosaurs of the past.

[1] http://en.wikipedia.org/wiki/Semantic_web
[2] http://en.wikipedia.org/wiki/Tim_Berners-Lee
[3] http://en.wikipedia.org/wiki/World_Wide_Web_Consortium
[4] http://en.wikipedia.org/wiki/Linked_Data
[5] http://en.wikipedia.org/wiki/Resource_Description_Framework
[6] http://en.wikipedia.org/wiki/Web_Ontology_Language
[7] http://en.wikipedia.org/wiki/GRDDL
[8] http://en.wikipedia.org/wiki/SPARQL

Main Links

Search

Author: Mike Bergman