November 6, 2008

UMBEL (Upper Mapping and Binding Exchange Layer)

Version 5.0.9 includes UMBEL Class Lookups and Named Entity Extraction

I first wrote about OpenLink Software‘s stellar suite of structured Web-related software back in April 2007, with a spotlight on Virtuoso, the company’s flagship ‘universal server’ product. As it has for years, OpenLink continues a steady drumbeat of new releases and extensions. The most recent version upgrade, 5.0.9, was announced today.

In the intervening period I have now personally had the chance to experience Virtuoso first hand, both as the standard hosting platform for Zitgist’s linked data products and services, and as the hosting environment for UMBEL‘s various and growing Web services. I can state quite categorically that our ability to get things done fast with few resources depends critically on the unbelievable high-productivity platform that Virtuoso provides. (And, hehe, given our close relationship to OpenLink, we also get great responsiveness and technical support! :) Though, truthfully, OpenLink continues to amaze with its outreach and embrace of all of the important initiatives within the semantic Web community.)

I normally let these standard Virtuoso release announcements pass without comment. But today’s release v. 5.0.9 has an especially important feature from my parochial perspective: the first support for UMBEL.

Virtuoso Reprised

Just to refresh memories, OpenLink’s Virtuoso is a cross-platform universal server for SQL, XML, and RDF data, including data management. It includes a powerful virtual database engine, full-text indexing, native hosting of existing applications, Web Services (WS*) deployment platform, Web application server, and bridges to numerous existing programming languages. Now in version 5.0, Virtuoso is also offered in an open source version. The basic technical architecture of Virtuoso and its robust capabilities is:

Virtuoso Architecture
[Click on image for full-size pop-up]

From an RDF and linked data perspective, Virtuoso is the most scalable and fastest platform on the market. Critically from Zitgist’s perspective is Virtuoso’s more than 100 built-in RDF-izers (or “Sponger cartridges”) that address all major data formats, serializations, relational data and Web 2.0 APIs. But don’t take my word for it: Check out OpenLink’s impressive list of these cartridges and their various linkages throughout the linked data space.

UMBEL Support

The key aspect of the new UMBEL support in Virtuoso is its incorporation of UMBEL lookups and its use of Named Entity extraction into the RDF-izer cartridges. This is but the first of growing support anticipated for UMBEL.

Other New Features

In addition to UMBEL, this version 5.0.9 includes significant performance optimizations to the SQL Engine, SPARQL+RDF Engine, and the ODBC and JDBC drivers.

Other new features include:

  • An Excel mime-type output option in the SPARQL endpoint
  • Enhanced triple options for bif:contains plus new options for transitivity
  • New RDF-izer Cartridges for the Sponger RDF Middleware Layer
  • Support for very large HTTP client requests
  • A sparql-auth endpoint with digest authentication for using SPARUL via SPARQL Protocol
  • New commands for the Ubiquity Firefox plugin.

Finally, per usual, there are also minor bug-fixes:

  • Memory leaks
  • SQL query syntax handling
  • SPARQL ‘select distinct’
  • XHTML and Javascript validation and other UI issues in the ODS application suite.

For More Details

For more details, you can see these Virtuoso release notes:

You can also get information on the Virtuoso open source edition or download it.

September 27, 2008

Zotero Bibliographic Plug-in

Infringement of EndNote Formats Claimed

Zotero has long been one of my favorite Firefox plug-ins, being a productive and trusted sidekick for collecting and reporting my voluminous citation and bibliographic data. I think perhaps my review of Zotero from January 2007 was one of my most glowing write-ups.

If you go to the Zotero home page, you will see at the lower left the steady increase of functionality that has come out in this free and open source tool. For example, Zotero now supports more than 1100 bibliographic sources, can capture Web pages and many standard Web sources, and has MS Office and WordPress support. Zotero has been developed and is distributed by the Center for History and New Media at George Mason University.

Thomson Reuters Sues George Mason University, Virginia

According to the Courthouse News Service with a copy of this complaint filed September 5, Thomson Reuters is suing George Mason University and, as a state institution, the Commonwealth of Virginia, for $10 million in damages and an injunction on further distribution of a beta version of Zotero. Thomson is seeking a jury trial.

Thomson claims that a July 8 beta release of Zotero (version 1.5) included a new feature to read and convert Thomson’s 3,500 plus proprietary .ens style files within the EndNote software into free, open source Zotero .csl files. Thomson claims this is in direct violation with GMU’s current license for EndNote. The Zotero beta release introduces a server-side synchronization function; the standard Zotero release without this feature and the EndNote support is version 1.07.

EndNote is a proprietary and popular citation software used by many academics and researchers. EndNote has very similar functionality to Zotero. It allows users to search online bibliographic databases, organize them, and store and re-format citations in various publication styles. Single user licenses are $250 with volume and academic discounts available. Thomson claims “millions” of ultimate users.

Thomson Reuters is also the firm behind the Open Calais named entity extraction service noted much in the semantic Web community (and which this week announced a commercial version).

File format ingest and conversions have long been a mainstay of interoperable software systems. This lawsuit will bear close monitoring.

Hat tip to Rafael Sidi for this link.

September 22, 2008

The Linkage of UMBEL’s 20,000 Subject Concepts and Inferencing Brings New Capabilities

Thanks to Kingsley Idehen and OpenLink Software, DBpedia has been much enrichened with its mapping to UMBEL‘s 20,000 class-based subject concepts. DBpedia is the structured data version of Wikipedia that I (among many) wrote about in depth in April of last year shortly after its release.

We have also recently gotten an updated estimate of the size of the semantic Web and a new release of the linking open data (LOD) cloud diagram.

A New Instance of the LOD Cloud Diagram

Since DBpedia’s release, it has become the central hub of linked open data as shown by this now-famous (and recently updated!) LOD diagram [1]:

Click for full size
[click for full size]

Each version of the diagram adds new bubbles (datasets) and new connections. The use of linked data, which is based on the RDF data model and uses Web protocols to name and access data, is proving to be a powerful framework for interconnecting disparate and heterogeneous information. As the diagram above shows, all types of information from a variety of public sources now make up the LOD cloud [2].

A Beginning Basis for Estimating the Size of the Semantic Web

The most recent analysis of this LOD cloud is by Michael Hasenblas and colleagues as presented at I-Semantics08 in September [3]. About 50 major datasets comprising roughly two billion triples and three million interlinks were contained in the cloud at the time of their analysis. They partitioned their analysis into two distinct types: 1) single-point-of-access datasets (akin to conventional databases), such as DBpedia or Geonames, and 2) distributed records characterized by RDF ontologies such as FOAF or SIOC. Their paper [3] should be reviewed for its own conclusions. In general, though, most links appear to be of low value (though a minority are quite useful).

Simple measures such as triples or links have little meaning in themselves. Moreover, and this is most telling, all of the LOD relationships in the diagram above and the general nature of linked data to date have based their connections on instance-level data. Often this takes the form that a specific person, place or thing in one dataset is related to that very same thing in another dataset using the owl:sameAs property; sometimes it is that one person knows another person; or, it may be in other examples that one entry has an associated photo. Entities are related to other entities and their attributes, but little is provided about the conceptual or structural relationships amongst those entities.

Instance-level mapping is highly useful to aggregate various attributes or facts about given entities or things. But, they only scratch the surface of the structure that can be made available through linked data and the conceptual relationships between and amongst all of those things. For those relationships to be drawn or inferred a different level of linkages needs to be made: what is the class or collection or schema view of the data.

The UMBEL Subject Concept ‘Backbone’

UMBEL, or similar conceptual frameworks, can provide this structural backbone.

UMBEL (Upper Mapping and Binding Exchange Layer; see is a lightweight reference ontology of about 20,000 subject concepts and their logical and semantic relationships. The UMBEL ontology is a direct derivation of the proven Cyc knowledge base from Cycorp, Inc. (see

UMBEL’s subject concepts provide mapping points for the many (indeed, millions of) named entities that are their notable instances. Examples might include the names of specific physicists, cities in a country, or a listing of financial stock exchanges. UMBEL mappings enable us to link a given named entity to the various subject classes of which it is a member.

And, because of relationships amongst subject concepts in the backbone, we can also relate that entity to other related entities and concepts. The UMBEL backbone traces the major pathways through the content graph of the Web.

The UMBEL backbone provides structure and relationships at large or small scale. For example, in its full extent, the structure of UMBEL’s complete structure resembles:

UMBEL Big Braph

But, we can dive into that structure with respect to automobiles or related concepts . . .

UMBEL Big Saab

. . . all the way down to seeing the relationships to Saab cars:

UMBEL Saab Neighborhood

It is this ability to provide context through structure and relations that can help organize and navigate large datasets of instances such as DBpedia. Until the application of UMBEL — or any subject or class structure like it — most of the true value within DBpedia has remained hidden.

But no longer.

Some Example Queries

UMBEL already had mapped most DBpedia instances to its own internal classes. By a simple mapping of files and then inferencing against the UMBEL classes, this structure has now been brought to DBpedia itself. Any SPARQL queries applied against DBpedia can now take advantage of these relationships.

Below are some sample queries Kingsley used to announce these UMBEL capabilities to the LOD mailing list [4]. You can test these queries yourself or try alternative ones by using a standard SPARQL query.

For example, go to one of DBpedia’s query endpoints such as and cut-and-paste one of these highlighted code snippets into the ‘Query text’ box:

Example Query 1

define input:inference ‘’
prefix umbel: <>
select ?s

?s a umbel:RoadVehicle

Example Query 2

define input:inference ‘’
prefix umbel: <>
select ?s

?s a umbel:Automobile_GasolineEngine

Example Query 3

define input:inference ‘’
prefix umbel: <>
select ?s

?s a umbel:Project

Example Query 4

define input:inference ‘’
prefix umbel: <>
select ?s

?s a umbel:Person

Example Query 5

define input:inference ‘’
prefix umbel: <>
select ?s

a umbel:Graduate;
a umbel:Boxer.

Example Query 6

define input:inference ‘’
prefix umbel: <>
prefix yago: <>
select ?s

a yago:FemaleBoxers;
a umbel:Graduate;
a umbel:Boxer.

Creating Your Own Mapping

By going to UMBEL’s technical documentation page at, you can download the files to create your own mappings (assuming you have a local instance of DBpedia).

The example below also assumes you are using the OpenLink Virtuoso server as your triple store. If you are using a different system, you will need to adjust your commands accordingly.

1. Load linkages (owl:sameAs) between UMBEL named entities and DBpedia resources

File: umbel_dbpedia_linkage.n3

select ttlp_mt (file_to_string_output (‘umbel_dbpedia_linkage.n3′), ”, ‘’);

2. Load inferred DBpedia types (rdf:types) based on UMBEL named entities

File: umbel_dbpedia_types.n3

select ttlp_mt (file_to_string_output (‘umbel_dbpedia_types.n3′), ”, ‘’);

3. Load Virtuoso-specific file containing the rules for inferencing

File: umbel_virtuoso_inference_rules.n3

select ttlp_mt (file_to_string_output (‘umbel_virtuoso_inference_rules.n3′), ”, ‘’);

4. Load UMBEL External Ontology Mapping into a Named Graph (owl:equivalentClasses)

File: umbel_external_ontologies_linkage.n3

select ttlp_mt (file_to_string_output (‘umbel_external_ontologies_linkage.n3′), ”, ‘’);

5. Create UMBEL Inference Rules

rdfs_rule_set (‘’, ‘’);


A new era of interacting with DBpedia is at hand. Within a period of just more than a year, the infrastructure and data are now available to show the advantages of the semantic Web based on a linked Web of data. DBpedia has been a major reason for showing these benefits; it is now positioned to continue to do so.

[1] This new LOD diagram is still being somewhat updated based on review. The version shown above is based on the one posted at the W3C’s SWEO wiki with my own updates of the two-way UMBEL links and the blue highlighting of DBpedia and UMBEL. There is also a clickable version of the diagram that will take you to the home references for the consituent data sources in this diagram; see
[2] The objective of the Linking Open Data community is to extend the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different data sources. All of the sources on the LOD diagram are such open data. However, the best practices of linked data can also be applied to proprietary or intranet information as well; see this FAQ.
[3] See, Michael Hausenblas, Wolfgang Halb, Yves Raimond and Tom Heath, 2008. What is the Size of the Semantic Web?, paper presented at the International Conference on Semantic Systems (I-Semantics08) at TRIPLE-I, Sept. 2008. See

March 5, 2008


Another Innovative Faceted Browser from UVa and the Humanities

A bit over a year ago I spotlighted Collex, a set of tools for COLLecting and EXhibiting information in the humanities. Collex was developed for the NINES project (which stands for the Networked Infrastructure for Nineteenth-century Electronic Scholarship, a trans-Atlantic federation of scholars). Collex has now spawned Blacklight, a library faceted browser and discovery tool.

Project Blacklight

Blacklight is intended as a general faceted browser with keyword inclusion for use by libraries and digital collections. As with Collex, Blacklight is based on the Lucene/Solr facet-capable full-text engine. The name Blacklight is based on the combination of Solr + UV(a).

Blacklight is being prototyped on UVa’s Digital Collections Repository. It was first shown at the 2007 code4lib meeting, but has recently been unveiled on the Web and released as an open source project. More on this aspect can be found at the Project Blacklight Web site.

Blacklight was developed by Erik Hatcher, the lead developer of Flare and Collex, with help from library staff Bess Sadler, Bethany Nowviskie, Erin Stalberg, and Chris Hoebeke. You can experiment yourself with Blacklight at:

The figure below shows a typical output. Various pre-defined facets, such as media type, source, library held, etc., can be combined with standard keyword searches.

Many others have pursued facets, and the ones in this prototype are not uniquely interesting. What is interesting, however, is the interface design and the relative ease of adding, removing or altering the various facets or queries to drive results:

Blacklight Faceted Browser


An extension of this effort, BlacklightDL, provides image and other digital media support to the basic browser engine. This instance, drawn from a separate experiment at UVa, shows a basic search of ‘Monticello’ when viewed through the Image Gallery:

BlacklightDL - Monticello

Like the main Blacklight browser, flexible facet selections and modification are offered. With the current DL prototype, using similar constructs from Collex, there are also pie chart graphics to show the filtering effects of these various dimensions (in this case, drilling down on ‘Monticello’ by searching for ‘furniture’):

BlacklightDL - Monticello Furniture

BlacklightDL is also working in conjunction with the OpenSource Connections (a resource worth reviewing in its own right).

Blacklight has just been released as an open source OPAC (online public access catalog). That means libraries (or anyone else) can use it to allow people to search and browse their collections online. Blacklight uses Solr to index and search, and uses Ruby on Rails for front-end configuration. Currently, Blacklight can index, search, and provide faceted browsing for MaRC records and several kinds of XML documents, including TEI, EAD, and GDMS; the code is available for downloading here.

Faceted Browsing

There is a rich and relatively long history of faceted browsing in the humanities and library science community. Notably, of course is Flamenco, one of the earliest dating from 2001 and still active, to MIT’s SIMILE Exhibit, which I have written of numerous times. Another online example is Footnote, a repository of nearly 30 million historical images. It has a nice interface and an especially nifty way of using a faceted timeline. Also see Solr in Libraries from Ryan Eby.

In fact, faceted browsing and search, especially as it adapts to more free-form structure, will likely be one of the important visualization paradigms for the structured Web. (It is probably time for me to do a major review of the area. :) )

The library and digital media and exhibits communities (such as museums) are working hard at the intersection of the Web, search, display and metadata and semantics. For example, we also have recently seen the public release of the Omeka exhibits framework from the same developers of Zotero, one of my favorite Firefox plug-ins. And Talis continues to be a leader in bringing the semantic Web to the library community.

The humanities and library/museum communities have clearly joined the biology community as key innovators of essential infrastructure to the semantic Web. Thanks, community. The rest of us should be giving more than a cursory wave to these developments.

* * *

BTW, I’d very much like to thank Mark Baltzegar for bringing many of these initiatives to my attention.

February 18, 2008

W3C Semantic WebWikipedia

Most Comprehensive Reference List Available Shows Impressive Depth, Breadth

Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research. The system is being tapped for both data and structure. Wikipedia has arguably replaced WordNet as the leading lexicon for concepts and relations. Because of its scope and popularity, many argue that Wikipedia is emerging as the de facto structure for classifying and organizing knowledge in the 21st century.

Our work on the UMBEL lightweight reference subject concept structure has stated since the project’s announcement in July 2007 that Wikipedia is a key intended resource for identifying subject concepts and entities. For the past few months I have been scouring the globe attempting to find every drop of research I could find on the use of Wikipedia for semantic Web, information extraction, categorization and related issues.

Thus, I’m pleased to offer up herein the most comprehensive such listing available anywhere: more than 99 resources and counting! (I say “more than” because some entries below have multiple resources; I just liked the sound of 99 as a round number!)

Wikipedia itself maintains a listing of academic studies using Wikipedia as a resource; fewer than one-third of the listings below are on that list (which itself may be an indication of the current state of completeness within Wikipedia). Some bloggers and other sources around the Web also maintain listings in lesser degrees of completeness.

It is well documented the tremendous growth of content and topics within Wikipedia (see, as examples, the W1, W2, W3, W4, W5, W6 and W7 internal Wikipedia sources for gory details), with as of early 2008 about 2.25 million articles in English and versions in 256 languages and variants.

Download access to the full knowledge base has enabled the development of notable core references to the Linked Data aspects of the semantic Web such as DBpedia [5,6] and YAGO [72,73]. Entire research teams, such as Ponzetto and Strube [61-65] (and others as well; see below) are moving toward creating a full-blown ontologies or structured knowledge bases useful for semantic Web purposes based on Wikipedia. So, one of the first and principle uses of Wikipedia to date has been as a data source of concepts, entities and relations.

But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:

  • Ontology development and categorization
  • Word sense disambiguation
  • Named entity recognition
  • Named entity disambiguation
  • Semantic relatedness and relations.

These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:

  • Articles
    • First paragraph — Definitions
    • Full text — Description of meaning; related terms; translations
    • Redirects — Synonymy; spelling variations, misspellings; abbreviations
    • Title — Named entities; domain specific terms or senses
    • Subject — Category suggestion (phrase marked in bold or in first paragraph)
    • Section heading — Category suggestions
  • Article links
    • Context — Related terms; co-occurrences
    • Label — Synonyms; spelling variations; related terms
    • Target — Link graph; related terms
    • LinksTo — Category suggestion
    • LinkedBy — Category suggestion
  • Categories
    • Category — Category suggestion
    • Contained articles — Semantically related terms (siblings)
    • Hierarchy — Hyponymic and meronymic relations between terms
  • Disambiguation pages
    • Article links — Sense inventory
  • Infobox Templates
    • Name –
    • Item — Category suggestion; entity suggestion
  • Lists
    • Hyponyms

These are some of the specific uses that are included in the 99 resources listed below.

This is an exciting (and, for most all of us just a few years back, unanticipated) use of the Web in socially relevant and contextual knowledge and research. I’m sure such a listing one year by now will be double in size or larger!

BTW, suggestions for new or overlooked entries are very much welcomed! :)

A – B

  1. Sisay Fissaha Adafre and Maarten de Rijke, 2006. Finding Similar Sentences across Multiple Languages in Wikipedia, in EACL 2006 Workshop on New Text–Wikis and Blogs and Other Dynamic Text Sources, April 2006.See
  2. Sisay Fissaha Adafre, V. Jijkoun and M. de Rijke. Fact Discovery in Wikipedia, in 2007 IEEE/WIC/ACM International Conference on Web Intelligence. See
  3. Sisay Fissaha Adafre and Maarten de Rijke, 2005. Discovering Missing Links in Wikipedia, in LinkKDD 2005, August 21, 2005, Chicago, IL. See
  4. David Ahn, Valentin Jijkoun, Gilad Mishne, Karin Müller, Maarten de Rijke, and Stefan Schlobach. 2004. Using Wikipedia at the TREC QA Track, in Proceedings of TREC 2004.
  5. Sören Auer, Chris Bizer, Jens Lehmann, Georgi Kobilarov, Richard Cyganiak and Zachary Ives, 2007. DBpedia: A nucleus for a web of open data, in Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea, volume 4825 of LNCS, pages 715–728, November 2007. See
  6. Sören Auer and Jens Lehmann, 2007. What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content, in The Semantic Web: Research and Applications, pages 503-517, 2007. See
  7. Somnath Banerjee, 2007. Boosting Inductive Transfer for Text Classification Using Wikipedia, at the Sixth International Conference on Machine Learning and Applications (ICMLA). See
  8. Somnath Banerjee, Krishnan Ramanathan, Ajay Gupta, 2007. Clustering Short Texts using Wikipedia, poster presented at Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, pp. 787-788.
  9. F. Bellomi and R. Bonato, 2005. Lexical Authorities in an Encyclopedic Corpus: A Case Study with Wikipedia, online reference not found.
  10. F. Bellomi and R. Bonato, 2005. Network Analysis for Wikipedia, presented at Wikimania 2005; see
  11. Bibauw (2005) analyzed the lexicographical structure of Wiktionary (in French).
  12. Abhijit Bhole, Blaž Fortuna, Marko Grobelnik and Dunja Mladenić, 2007.Mining Wikipedia and Relating Named Entities over Time, see
  13. Razvan Bunescu and Marius Pasca. 2006. Using Encyclopedic Knowledge for Named Entity Disambiguation, in Proceedings of the 11th Conference of the EACL, pages 9-16, Trento, Italy.
  14. Razvan Bunescu, 2007. Learning for Information Extraction: From Named Entity Recognition and Disambiguation To Relation Extraction, Ph.D. thesis for the University of Texas, August 2007, 168 pp. See
  15. Luciana Buriol, Carlos Castillo, Debora Donato, Stefano Leonardi, and Stefano Millozzi. 2006. Temporal Analysis of the Wikigraph, in Proceedings of Web Intelligence, Hong Kong.
  16. Davide Buscaldi and Paolo Rosso, 2007. A Comparison of Methods for the Automatic Identification of Locations in Wikipedia, in Proceedings of GIR’07, November 9, 2007, Lisbon, Portugal, pp 89-91. See

C – F

  1. Ruiz-Casado, M., Alfonseca, E., and Castells, P. 2005. Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets, in AWIC, pages 380-386. See
  2. Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells, 2006. From Wikipedia to Semantic Relationships: a Semi-automated Annotation Approach, in ESWC2006.
  3. Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells, 2007. Automatising the Learning of Lexical Patterns: an Application to the Enrichment of WordNet by Extracting Semantic Relationships from Wikipedia. See
  4. Sergey Chernov, Tereza Iofciu, Wolfgang Nejdl, and Xuan Zhou, 2006. Extracting Semantic Relationships between Wikipedia Categories, from SEMWIKI 2006.
  5. Aron Culotta, Andrew McCallum and Jonathan Betz, 2006. Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text, in Proceedings of HLT-NAACL-2006. See
  6. Silviu Cucerzan, 2007. Large-Scale Named Entity Disambiguation Based on Wikipedia Data, in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). See
  7. Cyclopedia, from the Cyc Foundation, an online version of Wikipedia that enables browsing the encyclopedia by concepts. See
  8. Wisam Dakka and Silviu Cucerzan, 2008. Augmenting Wikipedia with Named Entity Tags, to be published in IJCNLP. See
  9. Wisam Dakka and Silviu Cucerzan, 2008. Also, see the online tool (not yet implemented).
  10. Turdakov Denis, 2007. Recommender System Based on User-generated Content. See
  11. EachWiki, online system from Fu et al.
  12. Linyun Fu, Haofen Wang, Haiping Zhu, Huajie Zhang, Yang Wang and Yong Yu, 2007. Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring, at ISWC 2007. See

G – H

  1. Evgeniy Gabrilovich and Shaul Markovitch. 2006. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge, in AAAI, pages 1301-1306, Boston, MA.
  2. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, in Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007.
  3. Evgeniy Gabrilovich, 2006. Feature Generation for Textual Information Using World Knowledge, Ph.D. Thesis for The Technion – Israel Institute of Technology, Haifa, Israel, December 2006, 218 pp. See (There is also an informative video at
  4. See also Gabrilovich’s Perl Wikipedia tool, WikiPrep.
  5. Rudiger Gleim, Alexander Mehler and Matthias Dehmer, 2007. Web Corpus Mining by Instance of Wikipedia, in Proc. 2nd Web as Corpus Workshop at EACL 2006. See
  6. Andrew Gregorowicz and Mark A. Kramer, 2006. Mining a Large-Scale Term-Concept Network from Wikipedia, Mitre Technical Report, October 2006. See
  7. Andreas Harth, Hannes Gassert, Ina O’Murchu, John Breslin and Stefan Decker, 2005. WikiOnt: An Ontology for Describing and Exchanging Wiki Articles, presented at Wikimania, Frankfurt, 5th August 2005. See
  8. Martin Hepp, Daniel Bachlechner and Katharina Siorpaes, 2006. Harvesting Wiki Consensus – Using Wikipedia Entries as Ontology Elements, See
  9. Martin Hepp, Katharina Siorpaes, Daniel Bachlechner, 2007. Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management, in IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 2007. See Also, sample data at
  10. A. Herbelot and Ann Copestake, 2006. Acquiring Ontological Relationships from Wikipedia Using RMRS, in Proc. International Semantic Web Conference 2006 Workshop, Web Content Mining with Human Language Technologies, Athens, GA, 2006. See
  11. Ryuichiro Higashinaka, Kohji Dohsaka and Hideki Isozaki, 2007. Learning to Rank Definitions to Generate Quizzes for Interactive Information Presentation, in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; see pages 117-120 within
  12. Todd Holloway, Miran Bozicevic, and Katy Börner. 2005. Analyzing and Visualizing the Semantic Coverage of Wikipedia and Its Authors. ArXiv Computer Science e-prints, cs/0512085.
  13. Wei Che Huang, Andrew Trotman, and Shlomo Geva, 2007. Collaborative Knowledge Management: Evaluation of Automated Link Discovery in the Wikipedia, in SIGIR 2007 Workshop on Focused Retrieval, July 27, 2007, Amsterdam, The Netherlands. See

I – K

  1. Jonathan Isbell and Mark H. Butler, 2007. Extracting and Re-using Structured Data from Wikis, Hewlett-Packard Technical Report HPL-2007-182, 14th November, 2007, 22 pp. See
  2. Maciej Janik and Krys Kochut, 2007. Wikipedia in Action: Ontological Knowledge in Text Categorization, University of Georgia, Computer Science Department Technical Report no. UGA-CS-TR-07-001. See
  3. Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath and Gerhard Weikum, 2007. NAGA: Searching and Ranking Knowledge, Technical Report, Max-Planck-Institut f¨ur Informatik, MPI–I–2007–5–001, March 2007, 42 pp.See
  4. Jun’ichi Kazama and Kentaro Torisawa, 2007. Exploiting Wikipedia as External Knowledge for Named Entity Recognition, in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 698–707, Prague, June 2007.See
  5. Daniel Kinzler, 2005. WikiSense Mining the Wiki, V 1.1, presented at Wikimania 2005, 10 pp. See
  6. A. Krizhanovsky, 2006. Synonym Search in Wikipedia: Synarcher, in 11th International Conference “Speech and Computer” SPECOM’2006. Russia, St. Petersburg, June 25-29, 2006, pp. 474-477. See (also PDF).
  7. Natalia Kozlova, 2005. Automatic Ontology Extraction for Document Classification, a Master’s Thesis for Saarland University, February 2005, 90 pp. See$FILE/Masterarbeit-Kozlova-Nat-2005.pdf. [READ]
  8. M. Krötzsch, D. Vrandečić and M. Völkel, 2005. Wikipedia and the Semantic Web: the Missing Links, at Wikimania 2005.Wikipedia and the Semantic Web: the Missing Links, at Wikimania 2005. See

L – N

  1. Rada Mihalcea and Andras Csomai, 2007. Wikify! Linking Documents to Encyclopedic Knowledge, in Proceedings of the Sixteenth ACM conference on Conference on information and knowledge management CIKM ’07 , November 6-8, 2007, pp. 233-241. See ACM portal retrieval (
  2. Rada Mihalcea, 2007. Using Wikipedia for Automatic Word Sense Disambiguation, in Proceedings of NAACL HLT 2007, pages 196–203, April 2007.See
  3. David Milne, 2007. Computing Semantic Relatedness using Wikipedia Link Structure; see also the Wikipedia Miner Toolkit ( provided by the author
  4. D. Milne, O. Medelyan and I. H. Witten, 2006. Mining Domain-Specific Thesauri from Wikipedia: A Case Study, in Proceedings of the International Conference on Web Intelligence (IEEE/WIC/ACM WI’2006), Hong Kong. Also Olena Medelyan’s home page,
  5. David Milne, Ian H. Witten and David M. Nichols, 2007. A Knowledge-Based Search Engine Powered by Wikipedia, at CIKM ’07.
  6. Lev Muchnik, Royi Itzhack, Sorin Solomon and Yoram Louzoun, 2007. Self-emergence of Knowledge Trees: Extraction of the Wikipedia Hierarchies, in Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), Vol. 76, No. 1.
  7. Nadeau, D., Turney, P., Matwin, S., 2006. Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity, at 19th Canadian Conference on Artificial Intelligence. Québec City, Québec, Canada. June 7, 2006.See Doesn’t specifically use Wikipedia, but techniques are applicable.
  8. Kotaro Nakayama, Takahiro Hara and Shojiro Nishio, 2007. Wikipedia Mining for an Association Web Thesaurus Construction, in Web Information Systems Engineering – WISE 2007, Vol. 4831 (2007), pp. 322-334.
  9. Dat P.T. Nguyen, Yutaka Matsuo and Mitsuru Ishizuka, 2007. Relation Extraction from Wikipedia Using Subtree Mining, from AAAI ’07.

O – P

  1. Yann Ollivier and Pierre Senellar, 2007. Finding Related Pages Using Green Measures: An Illustration with Wikipedia, in Proceedings of the AAAI-07 Conference. See See also for tools and data.
  2. Simon Overell and Stefan Ruger, 2006. Identifying and Grounding Descriptions of Places, in SIGIR Workshop on Geographic Information Retrieval, pages 14–16, 2006.See
  3. Simone Paolo Ponzetto and Michael Strube, 2006. Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution, in NAACL 2006. See
  4. Simone Paolo Ponzetto and Michael Strube, 2007a. Deriving a Large Scale Taxonomy from Wikipedia, in Association for the Advancement of Artificial Intelligence (AAAI2007).
  5. Simone Paolo Ponzetto and Michael Strube, 2007b. Knowledge Derived From Wikipedia For Computing Semantic Relatedness, in Journal of Artificial Intelligence Research 30 (2007) 181-212. See also these PPT slides (in PDF format): Part I, Part II, and References.
  6. Simone Paolo Ponzetto and Michael Strube, 2007c. An API for Measuring the Relatedness of Words in Wikipedia, in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, See pages 49-52 within
  7. Simone Paolo Ponzetto, 2007. Creating a Knowledge Base from a Collaboratively Generated Encyclopedia, in Proceedings of the NAACL-HLT 2007 Doctoral Consortium, pp 9-12, Rochester, NY, April 2007. See

R – S

  1. Tyler Riddle. 2006. Parse::mediawikidump. URL
  2. Ralf Schenkel, Fabian Suchanek and Gjergji Kasneci, 2007. YAWN: A Semantically Annotated Wikipedia XML Corpus, in BTW2007.
  3. Péter Schönhofen, 2006. Identifying Document Topics Using the Wikipedia Category Network, in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 456-462, 2006.See
  4. Börkur Sigurbjörnsson, Jaap Kamps, and Maarten de Rijke. 2006. Focused Access to Wikipedia. In Proceedings DIR-2006.
  5. Suzette Kruger Stoutenburg, 2007. Research Proposal: Acquiring the Semantic Relationships of Links between Wikipedia Articles, class proposal to
  6. Strube, M. and Ponzetto, S. P. 2006. WikiRelate! Computing Semantic Relatedness Using Wikipedia, in AAAI, pages 1419-1424, Boston, MA.
  7. Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, Yago – A Core of Semantic Knowledge, in 16th international World Wide Web Conference (WWW 2007), Banff, Alberta. See
  8. Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, 2007. Yago: A Large Ontology from Wikipedia and WordNet, in Technical Report, submitted to the Elsevier Journal of Web Semantics 67 pp. See
  9. Fabian M. Suchanek, Georgiana Ifrim and Gerhard Weikum, 2006. Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents, in Knowledge Discovery and Data Mining (KDD 2006). See
  10. S. Suh, H. Halpin and E. Klein, 2006. Extracting Common Sense Knowledge from Wikipedia, in Proc. International Semantic Web Conference 2006 Workshop, Web Content Mining with Human Language Technologies, Athens, GA, 2006. See
  11. Zareen Syed, Tim Finin, and Anupam Joshi, 2008. Wikipedia as an Ontology for Describing Documents, from Proceedings of the Second International Conference on Weblogs and Social Media, AAAI, March 31, 2008. See

T – V

  1. J. A. Thom, J, Pehcevski and A. M. Vercoustre, 2007. Use of Wikipedia Categories in Entity Ranking. in Proceedings of the 12th Australasian Document Computing Symposium, Melbourne, Australia (2007). See
  2. Antonio Toral and Rafael Muñoz, 2007. Towards a Named Entity Wordnet (NEWN), in Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP). Borovets (Bulgaria). pp. 604-608 . September 2007.See
  3. Antonio Toral and Rafael Muñoz, 2006. A Proposal to Automatically Build and Maintain Gazetteers for Named Entity Recognition by using Wikipedia, in Workshop on New Text, 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento (Italy). April 2006.See
  4. Anne-Marie Vercoustre, Jovan Pehcevski and James A. Thom, 2007. Using Wikipedia Categories and Links in Entity Ranking, in Pre-proceedings of the Sixth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2007), Dec 17, 2007. See
  5. Anne-Marie Vercoustre, James A. Thom and Jovan Pehcevski, 2008. Entity Ranking in Wikipedia, in SAC’08 March 16-20, 2008, Fortaleza, Ceara, Brazil. See
  6. Max Völkel, Markus Krötzsch, Denny Vrandecic, Heiko Haller and Rudi Studer, 2006. Semantic Wikipedia, in Proceedings of WWW2006, pp 585-594.See
  7. Jakob Voss. 2006. Collaborative Thesaurus Tagging the Wikipedia Way. ArXiv Computer Science e-prints, cs/0604036. See
  8. Denny Vrandecic, Markus Krötzsch and Max Völkel, 2007. Wikipedia and the Semantic Web, Part II, in Phoebe Ayers and Nicholas Boalch, Proceedings of Wikimania 2006 – The Second International Wikimedia Conference, Wikimedia Foundation, Cambridge, MA, USA, August 2007.See

W – Z

  1. Wang Y , Wang H , Zhu H , Yu Y, 2007. Exploit Semantic Information for Category Annotation Recommendation in Wikipedia, in Natural Language Processing and Information Systems (2007), pp. 48-60.
  2. Gang Wang, Yong Yu and Haiping Zhu, 2007. PORE: Positive-Only Relation Extraction from Wikipedia Text, at ISWC 2007, See
  3. Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto, 2007. A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, in EMNLP-CoNLL 2007, 29th June 2007, Prague, Czech. Has a useful CRF overview. See and
  4. Timothy Weale, 2006. Utilizing Wikipedia Categories for Document Classification.
  5. Nicolas Weber and Paul Buitelaar, 2006. Web-based Ontology Learning with ISOLDE. See
  6. Wikify!, online service to automatically turn selected keywords into Wikipedia-style links. See
  7. Wikimedia Foundation. 2006. Wikipedia. URL
  8. Wikipedia, full database. See
  9. Wikipedia, general references. See especially (but only a small subset) for the Semantic Web, RDF, RDF Schema, OWL, SPARQL, GRDDL, W3C, Linked Data, many, many ontologies and controlled vocabularies such as FOAF, SKOS, SIOC, Dublin Core, etc; description logic areas (such as FOL), etc., etc. etc.
  10. Wikipedia, (as a) research source; see
  11. Fei Wu and Daniel S. Weld, 2007. Autonomously Semantifying Wikipedia.
  12. Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita & Giuseppe Attardi, 2007. Ranking Very Many Typed Entities on Wikipedia, in CIKM ’07: Proceedings of the Sixteenth ACM International Conference on Information and Knowledge Management. See
  13. Torsten Zesch, Iryna Gurevych, Max Mühlhäuser, 2007. Analyzing and Accessing Wikipedia as a Lexical Semantic Resource, and the longer technical report. See
  14. Torsten Zesch and Iryna Gurevych, 2007. Analysis of the Wikipedia Category Graph for NLP Applications, in Proceedings of the TextGraphs-2 Workshop (NAACL-HLT).
  15. Vinko Zlatic, Miran Bozicevic, Hrvoje Stefancic, and Mladen Domazet. 2006. Wikipedias: Collaborative Web-based Encyclopedias as Complex Networks, in Physical Review E, 74:016115.