This Week Marks Two New Milestones on the Road to the Semantic Web
Encyclopedia of Life
EOL is meant to be the Wikipedia of all 1.8 million known living species, backed with real money and real prestige. The effort continues a growing list of impressive and open data sources being compiled and presented on the Web.
The Encyclopedia of Life is a planned 10-year effort to create a free Internet resource to catalog and describe every one of the planet’s organisms. Initially seeded with $12.5 million in backing from two US philanthropies, the John D. and Catherine T. MacArthur Foundation and the Alfred P. Sloan Foundation, the effort is anticipated to cost $100 million by completion.
EOL is based on an idea by Harvard biologist and noted author Edward O. Wilson, a 2007 TED winner for the idea (as is better seen and explained in his acceptance video). Other initial project backers are Harvard University, the Smithsonian Institution, Missouri Botantical Garden, the Biodiversity Library Project with its international participants, and the Chicago Field Museum of Natural History.
Each species will get its own Web page with a structured record, similar to the “infoboxes” within Wikipedia. Information will include photos, technical name and phylogeny, lay description, range maps, and place within the Tree of Life. More prominent species will also get information on its genetics and behavior and compiled scientific research articles, some from centuries ago.
EOL is two and one-half orders of magnitude more ambitious than the earlier Tree of Life Web Project from the University of Arizona with its 5000 pages, and 20 times more ambitious than Wikipedia’s own Wikispecies.
Potential applications range from conservation to mapping to identifying the odd plant or animal species. But, has been found with important predecessor data sets such as GenBank (65 billion genetic sequences for a 100,000 organisms), FishBase (30,000 species) or Conabio (biodiversity in Mexico), among literally hundreds of others, popularity and use exceeds wildest expectations. Conservationists, researchers, school children, amateur naturalists, and outdoor lovers will all find use for the site.
Linking Open Data
EOL is but the newest poster child of massive and hugely important datasets being exposed on the Internet. One initiative with the sole purpose to promote this trend and the interoperability of the data utilizing RDF is the Linking Open Data project. The LOD project is part of the W3C’s Semantic Web Education and Outreach interest group (yes, the unwieldy, SWEO-IG).
The LOD project is holding its first meetings ever this week at WWW2007 in Banff, Alberta. While only a mere six-months old, the SWEO-IG is stimulating much broader interest in the semantic Web through sponsored products, FAQs (also announced yesterday and got really digged!), use cases, supporting information, and growing and useful compilations such as datasets and semantic Web tools.
Anyone interested in the semantic Web should really be familiar with the ESW wiki (which unfortunately is in need of re-organization and clean-up; but keep poking, there are many riches hidden in the nooks and crannies!). You may also want to track the public mailing list of the Linking Open Data group.
I already wrote extensively on the exciting DBpedia intiative (Did you Blink? The Structured Web Just Arrived), itself one of the catalytic datasets at the core of the Linking Open Data efforts. Other datasets actively being pursued for inclusion by the group include:
- Geonames data, including its ontology and 6 million geographical places and features, including implementation of RDF data services
- 700 million triples of U.S. Census data from Josh Tauberer
- Revyu.com, the RDF-based reviewing and rating site, including its links to FOAF, the Review Vocab and Richard Newman's Tag Ontology
- The "RDFizers" from MIT Simile Project (not to mention other tools), plus 50 million triples from the MIT Libraries catalog covering about a million books
- GEMET, the GEneral Multilingual Environmental Thesaurus of the European Environment Agency
- And, WordNet through the YAGO project, and its potential for an improved hierarchical ontology to the Wikipedia data.
Additional candidate datasets of interest have also been identified by the SWEO interest group on this page:
- Gene Ontology database and its 6 million annotations
- Gene fruitfly embryogenesis images from the Berkeley Drosophila Genome Project
- Roller Blog entries using the Atom/OWL vocabulary
- Various semantic Web interest group and conference materials
- Various FOAF-enabled profiles
- The UniProt protein database with its 300 million triples
- OpenGuides, a network of wiki-based city guides
- dbtune, an RDF-enabled version of the Magnatune music database using the Music Ontology
- The SKOS Data Zone
- Other MIT Simile data collections, including the CIA's World Factbook, Library of Congress' Thesaurus of Graphic Materials, National Cancer Institute's cancer thesaurus, W3C's technical reports
- The RDF version of the DMOZ Open Directory Project
- GovTrack.us of U.S. Congress legislator and voting records
- Chef Moz restaurant and review guides from DMOZ
- DOAP Store and its DOAP project descriptions, and
The SWEO-IG and its Linking Open Data initiative are catalyzing a new phase of excitement with relevant information, data and demos.
If anything, all I can say is: Set your sights higher! There’s data galore that needs to be “RDFized”.
So, folks, keep up the great work! And good luck with this week’s meetings in the beauty of Canada.
[BTW, please make sure that EOL makes its data available in RDF!]