Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research. The system is being tapped for both data and structure. Wikipedia has arguably replaced WordNet as the leading lexicon for concepts and relations. Because of its scope and popularity, many argue that Wikipedia is emerging as the de facto structure for classifying and organizing knowledge in the 21st century.
Our work on the UMBEL lightweight reference subject concept structure has stated since the project’s announcement in July 2007 that Wikipedia is a key intended resource for identifying subject concepts and entities. For the past few months I have been scouring the globe attempting to find every drop of research I could find on the use of Wikipedia for semantic Web, information extraction, categorization and related issues.
Thus, I’m pleased to offer up herein the most comprehensive such listing available anywhere: more than 99 resources and counting! (I say “more than” because some entries below have multiple resources; I just liked the sound of 99 as a round number!)
Wikipedia itself maintains a listing of academic studies using Wikipedia as a resource; fewer than one-third of the listings below are on that list (which itself may be an indication of the current state of completeness within Wikipedia). Some bloggers and other sources around the Web also maintain listings in lesser degrees of completeness.
It is well documented the tremendous growth of content and topics within Wikipedia (see, as examples, the W1, W2, W3, W4, W5, W6 and W7 internal Wikipedia sources for gory details), with as of early 2008 about 2.25 million articles in English and versions in 256 languages and variants.
Download access to the full knowledge base has enabled the development of notable core references to the Linked Data aspects of the semantic Web such as DBpedia [5,6] and YAGO [72,73]. Entire research teams, such as Ponzetto and Strube [61-65] (and others as well; see below) are moving toward creating a full-blown ontologies or structured knowledge bases useful for semantic Web purposes based on Wikipedia. So, one of the first and principle uses of Wikipedia to date has been as a data source of concepts, entities and relations.
But much broader data mining and text mining and analysis is being conducted against Wikipedia, that is currently defining the state-of-the-art in these areas, too:
These objectives, in turn, are mining and extracting these various kinds of structure for these purposes in Wikipedia:
These are some of the specific uses that are included in the 99 resources listed below.
This is an exciting (and, for most all of us just a few years back, unanticipated) use of the Web in socially relevant and contextual knowledge and research. I’m sure such a listing one year by now will be double in size or larger!
BTW, suggestions for new or overlooked entries are very much welcomed!
I still never cease to be amazed at how wonderful and powerful tools are so often and easily overlooked. The most recent example is Cytoscape, a winner in our recent review of more than 25 tools for large-scale RDF graph visualization.
We began this review because the UMBEL subject concept “backbone” ontology will involve literally thousands of concepts. Graph visualization software suitable to very large graphs would aid UMBEL’s construction and refinement.
Cytoscape describes itself as a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. Cytoscape is partially based on GINY and Piccolo, among other open-source toolkits. What is more important to our immediate purposes, however, is that its design also lends itself well to general network and graph manipulation.
Cytoscape was first brought to our attention by François Belleau of Bio2RDF.org. Thanks François, and also for the strong recommendation and tips. Special thanks are also due to Frédérick Giasson of Zitgist for his early testing and case examples. Thanks, Fred!
We had a number of requirements and items on our wish list prior to beginning our review. We certainly did not expect most or all of these items to be met:
Cytoscape met or exceeded our wish list in all areas save one: it does not support direct ingest of RDF (other than some pre-set BioPAX formats). However, that proved to be no obstacle because of the clean input format support of the tool. Simple parsing of triples into a CSV file is sufficient for input. Moreover, as described below, there are other cool attribute management functions that this clean file format supports as well.
The following screen shot shows the major Cytoscape screen. We will briefly walk through some of its key views (click for full size):
This Java tool has a fairly standard Eclipse-like interface and design. The main display window (A) shows the active portion of the current graph view. (Note that in this instance we are looking at a ‘Spring’ layout for the same Music sub-graph presented above.) Selections can easily be made in this main display (the red box) or by directly clicking on a node. The display itself represents a zoom (B) of the main UMBEL graph, which can also be easily panned (the blue box on B) or itself scaled (C). Those items that are selected in the main display window also appear as editable nodes or edges and attributes in the data editing view (D).
The appearance of the graph is fully editable via the VizMapper (E). An interesting aspect here is that every relation type in the graph (its RDF properties, or predicates) can be visually displayed in a different manner. The graphs or sub-graphs themselves can be selected, but also most importantly, the display can respond to a very robust and flexible filtering framework (F). Filters can be easily imported and can apply to nodes, edges (relations), the full graph or other aspects (depending on plugin). A really neat feature is the ability to search the graph in various flexible ways (G), which alters the display view. Any field or attribute can be indexed for faster performance.
In addition to these points, Cytoscape supports the following features:
The Cytoscape project also offers:
Unfortunately, other than these official resources, there appears to be a dearth of general community discussion and tips on the Web. Here’s hoping that situation soon changes!
There is a broad suite of plugins available for Cytoscape, and directions to developers for developing new ones.
The master page also includes third-party plugins. The candidates useful to UMBEL and its graphing needs — also applicable to standard semantic Web applications — appear to be:
Importantly, please note there is a wealth of biology- and molecular-specific plugins also available that are not included in the generic listing above.
Our initial use of the tool suggests some use tips:
Cytoscape was first released in 2002 and has undergone steady development since. Most recently, the 2.x and especially 2.3 versions forward have seen a flurry of general developments that have greatly broadened the tool’s appeal and capabilities. It was perhaps only these more recent developments that have positioned Cytoscape for broader use.
I suspect another reason that this tool has been overlooked by the general semWeb community is the fact that its sponsors have positioned it mostly in the biological space. Their short descriptor for the project, for example, is: Cytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. That statement hardly makes it sound like a general tool!
Another reason for the lack of attention, of course, is the common tendency for different disciplines not to share enough information. Indeed, one reason for my starting the Sweet Tools listing was hopefully as a means of overcoming artificial boundaries and assembling relevant semantic Web tools in one central place.
Yet despite the product’s name and its positioning by sponsors, Cytoscape is indeed a general graph visualization tool, and arguably the most powerful one reviewed from our earlier list. Cytoscape can easily accommodate any generalized graph structure, is scalable, provides all conceivable visualization and modeling options, and has a clean extension and plugin framework for adding specialized functionality.
With just minor tweaks or new plugins, Cytoscape could directly read RDF and its various serializations, could support processing any arbitrary OWL or RDF-S ontology, and could support other specific semWeb-related tasks. As well, a tool like CPath (http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1660554), which enables querying of biological databases and then storing them in Cytoscape format, offers some tantalizing prospects for a general model for other Web query options.
For these reasons, I gladly announce Cytoscape as the next deserving winner of the (highly coveted, but cheesy! ) AI3 Jewels & Doubloons award.
Cytoscape’s sponsors — the U.S. National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH), the U.S. National Science Foundation (NSF) and Unilever PLC — and its developers — the Institute for Systems Biology, the University of California – San Diego, the Memorial Sloan-Kettering Cancer Center, L’Institut Pasteur and Agilent Technologies – are to be heartily thanked for this excellent tool!
|An AI3 Jewels & Doubloons Winner|
This new WP version adds enhanced support for tagging, among others (see Aaron Brazell’s 10 things you should know about this release).
I will be doing my own testing when a stable release is issued, but this is good news for this popular plug-in. You may get version 0.5 of Advanced TinyMCE from here.
The essence of the Web is the link. We use it to navigate, discover, form communities and get high rankings (or not!) for our Web pages on search engines. But, each link carries much more behind it than what has generally been exposed. That is, until now . . . .
Frédérick Giasson is a pragmatic innovator of the structured Web and semantic Web. Most recently, his efforts have included Ping the Semantic Web (that aggregates RDF published on the Web), the Zitgist semantic Web browser (that enables that RDF data to be viewed in useful ways), TalkDigger (for finding and sharing topical Web discussions), and efforts on a variety of ontologies, including jointly with me on UMBEL.
I have been an aggressive “linker” for some time and try to refer to Wikipedia often for definitions or background as well. Thus, Fred’s most recent efforts to continue to add value to the link as the basic coin of the Web realm really caught my eye.
In the early days of the Web, links were used solely to visit specific Web pages or locations within those documents. Somewhat later, actions such as searching or purchasing items could be associated with a link. Most recently, with the emergence of the semantic Web, the very nature of the link has become ambiguous, potentially representing any of the link’s former uses or either direct or indirect references to data and resources.
Thus, we see that links can fulfill three different purposes, in rough order of their emergence:
The emergence of linked data and the semantic Web (or at least the provision of data via the structured Web) are making the use of the link more complicated and ambiguous. Moreover, sometimes a link is an indirect reference to where data exists, and not the actual resource itself.
What Zitgist’s zLinks does is to make these uses and to remove ambiguities. Further, if a link is not to an actual resource but only a reference to it, zLinks to the link’s correct destination. And, still further, a zLinks link is the to still additional links from its reference destination, making the service a powerful jumping off point in the true spirit of the interlinked Web.
To my knowledge, zLinks is is the first and purest implementation of what Kingsley Idehen has termed the “enhanced anchor” or <a++>. RDFa and embedded RDF have similar objectives but are not premised on resolving the existing link.
Like the SIOC Import Plug-in, which imports SIOC metadata into a WordPress blog, the zLinks tool recognizes the importance of standard blogging software and automated background tools to expose data and capabilities. Since WordPress has many hundreds of thousands of site owners and bloggers — not to mention hundreds of millions of visitors — zLinks could be an important first exposure for many to the real power of linking and the semantic Web.
As a site owner, zLinks works identically to other plug-ins: simply install it and then it works smoothly and easily.
As a site user who might encounter a zLinks icon in a WordPress blog, all you need to do is click on mouse over the zLinks launcher icon at the end of any visible link. You will first get an alert that the system is working, retrieving all of the necessary background link information. You will then get a popup showing the results, similar to this one for my own AI3 blog:
The zLinks popup offers direct and related links, with the icons and other associated information an indicator as to the nature of the link and its purpose. In our example case, I click on my name reference, which brings up my FOAF file in the Zitgist browser:
Note how picture, mapping and other information is automatically “meshed” with my FOAF file. From this Zitgist browser location, I could obviously continue to explore still further links and relationships. In this manner, zLinks adds an entirely new dynamic dimension to the concept of ‘interlinking.’
If the initial zLinks link references data, that data is now resolved to its proper direct location, and is presented as RDF with further meshing and manipulation available. Other resources may take you directly to a Web page or perform other actions. Some of those actions, for example, may be to format data results in specific views (timelines, maps, charts, tables, graphs, structured reports, etc.). If the sources are data, the ability to make transformations or present the data in various views opens a rich horizon of options.
I made some minor tweaks to the Zitgist distribution as provided. First, I replaced the initial link icon — – with this one –– that is smaller and more in keeping with my local WordPress theme. I did this simply by replacing the mini_rdf.gif image in the /public_html/wp-content/plugins/zitgist-browser-linker/imgs/ directory.
Then, also in keeping with my local theme, I made the text in the popup a bit smaller. I did this simply by adding a font-size: 80%; property to the style.css stylesheet in the /public_html/wp-content/plugins/zitgist-browser-linker/css/ directory.
And, that was it! Simple and sweet.
It is also important to realize that this is just a first-release prototype. Some initial bugs have been discovered and worked out, sometimes the server site is down, and longer-term potentialities are only now beginning to emerge. But, this is still professional software with much thought behind it and much potential in front of it. If it breaks, so what? It is free and it is fun.
To all of you out there new to RDF and structured, linked data, I say: Play and enjoy!
zLinks is only beginning to touch the most visible part of the iceberg. It is pretty clear that the use and usefulness of links are only now being understood. Harking back to the original listing of three possible uses for a link it is clear that “actions” and the use of the link itself as a referrer and “mini-banner” on the Web are still not appreciated, let alone exploited.
It is interesting that AdaptiveBlue has also come out with a SmartLinks approach that differs somewhat from the Zitgist approach (items and linkages are constructed and then referred to from a central location), but their screenshot does affirm the untapped potential of links.
The W3C semantic Web community continues to grapple with resource/link terminology and nuances, the implications of which will be deferred to another day and another blog entry. However, suffice it to say that with a growing ‘Web of data’ and linked data, not to mention the original document vision and then one of commerce and services, the once lowly link is growing mighty indeed!
David Huynh, a Ph.D. grad student developer par excellence from MIT’s Simile program, has just announced the beta availability of Potluck. Potluck allows casual users to mashup data on the Web using direct manipulation and simultaneous editing techniques, generally (but not exclusively!) based on Exhibit-powered pages.
Besides Potluck and Exhibit, David has also been the lead developer on such innovative Simile efforts as Piggy Bank, Timeline, Ajax, Babel, and Sifter, as well as a contributor to Longwell and Solvent. Each merits a look. Those familiar with these other projects will notice David’s distinct interface style in Potluck.
There is a helpful 6-min movie on Potluck that gives a basic overview of use and operation. I recommend you start here. Those who want more details can also read the Potluck paper in PDF, just accepted for presentation at ISWC 2007. And, after playing with the online demo, you can also download the beta source code directly from the Simile site.
Please note that Firefox is the browser of choice for this beta; Internet Explorer support is limited.
To invoke Potluck, you simply go to the demo page, enter two or more appropriate source URLs for mashup, and press Mix Data:
(You can also get to the movie from this page.)
Once the datasets are loaded, all fields from the respective sources are rendered as field tags. To combine different fields from different datasets, the respective field tags (color coded by dataset) to be matched are simply dragged to a new column. Differences in field value formats between datasets can be edited with an innovative approach to simultaneous group editing (see below). Once fields are aligned, they then may be assigned as browsing facets. The last step in working with the Potluck mashup is choosing either tabular or map views for the results display.
Potluck is designed to mashup existing Exhibit displays (JSON format), and is therefore lightweight in design. (Generally, Exhibit should be limited to about 500 data records or so per set.)
However, with the addition of the appropriate type name when specifying one of the sources to mash up, you can also use spreadsheet (xls), BibTeX, N3 or RDF/XML formats. The demo page contains a few sample data links. Additional sample data files for different mime types are (note entry using a space with type designator at end):
Besides the standard tabular display, you can also map results. For example, use the BibTeX example above and drop the “address” field into the first drop target area. Then, chose Map at the top of the display to get a mapping of conference locations.
In my own case, I mashed up this source and the xls sample on presidents, and then plotted out location in the US:
Given the capabilities in some of the other Simile tool sets, incorporating timelines or other views should be relatively straightforward.
Different datasets name similar or identical things differently and characterize their data differently. You can’t combine data from different datasets without resolving these differences. These various heterogeneities — which by some counts can be 40 or so classes of possible differences — were tabulated in one of my recent structured Web posts.
There has been considerable discussion in recent days on various ontology and semantic Web mailing lists about how some practices may solve or not questions of semantic matching. Some express sentiments that proper use of URIs, use of similar namespaces and use of some predicates like owl:sameAs may largely resolve these matters.
However, discussion in David’s ISWC 2007 paper and use of the Potluck demo readily show the pragmatic issues in such matches. Section 2 in the paper presents a readable scenario for real-world challenges in how a historian without programming skills would go about matching and merging data. Despite best practices, and even if all are pursued, actually “meshing” data together from different sources requires judgment and reconciliation. One of the great values of Potluck is as a heuristic and learning tool for making prominent these real-world semantic heterogeneities.
The complementary value of Potluck is its innovative interface design for actually doing such meshing. Potluck is a case argument that pragmatic solutions and designs only come about by just “doing it.”
(Note: Though a diagram illustrates some points below, it is no substitute for using Potluck yourself.)
Potluck uses a simple drag-and-drop model for matching fields from different datasets. In the left-hand oval in the diagram below, the user clicks on a field name in a record, drags it to a column, and then repeats that process for matching fields in a records of a different dataset. In the instance below, we are matching the field names of “address” and “birth-place”, which then also get color coded by dataset:
This process can be repeated for multiple field matches. The merged fields themselves can be subsequently dragged-and-dropped to new columns for renaming or still further merging.
The core innovation at the heart of Potluck is what happens next. By clicking on Edit for any record in a merged field, the dialog shown above pops up. This dialog supports simultaneous group editing based on LAPIS, another MIT tool for editing text with lightweight structure developed by Ron Miller and team.
The first grouping mostly ensures that data formatted differently in different datasets are displayed in their own column. One data form is used for the merged field, and all other columns are group edited to conform. The actual patterns are based on runs of digits, letters, white spaces, or individual punctuation marks and symbols, which are then “greedy” aligned for first the column grouping and then for cursor alignment within columns on highlighted patterns.
The net result is very fast and efficient bulk editing. This approach points the way to more complicated pattern matches and other substitution possibilities (such as unit changes or date and time formats).
I was tempted to award Potluck one of AI3‘s Jewels and Doubloons Awards, but the tool is still premature with rough spots and gaps. For examples, IE and browser support needs to be improved; it would be helpful to be able to delete a record from inclusion in the mashup. (Sometimes only after combining is it clear some records don’t belong together.)
Another big issue is that whole classes of functionality, such as writing out combined results or more data view options, are missing.
Of course, this code is not claimed to be commercial grade. What is most important is its pathbreaking approach to semantic mashups (actually, what some others such as Jonathan Lathem have called ‘smashups’) and interfaces and approaches to group editing techniques.
I hope that others pick up on this tool in earnest. David Huynh is himself getting close to completing his degree and may not have much time in the foreseeable future to continue Potluck development. Besides Potluck’s potential to evolve into a real production-grade utility, I think its potential to act as a learning test bed for new UI approaches and techniques for resolving semantic heterogeneities is even greater.