I’m pleased to present a timeline of 100 or so of the most significant events and developments in the innovation and management of information and documents from cave paintings ( ca 30,000 BC) to the present. Click on the link to the left or on the screen capture below to go to the actual interactive timeline.
This timeline has fast and slow scroll bands — including bubble popups with more information and pictures for each of the entries offered. (See the bottom of this posting for other usage tips.)
Note the timeline only presents non-electronic innovations and developments from alphabets to writing to printing and information organization and conventions. Because there are so many innovations and they are concentrated in the last 100 years or fewer, digital and electronic communications are somewhat arbitrarily excluded from the listing.
I present below some brief comments on why I created this timeline, some caveats about its contents, and some basic use tips. I conclude with thanks to the kind contributors.
Readers of this AI3 blog or my detailed bio know that information — biological embodied in genes, or cultural embodied in human artefacts — has been my lifelong passion. I enjoy making connections between the biological and cultural with respect to human adaptivity and future prospects and I like to dabble on occasion as an amateur economic or information science historian.
About 18 months ago I came across David Huynh‘s nifty Exhibit lightweight data display widget, gave it a glowing review, and then proceeded to convert my growing Sweet Tools listing of semantic Web and related tools to that format. Exhibit still powers the listing (which I just updated yesterday for the twelfth time or so).
At the time of first rolling out Exhibit I also noted that David had earlier created another lightweight timeline display widget that looked similarly cool (and which was also the first API for rendering interactive timelines in Web pages). (In fact, Exhibit and Timeline are but two of the growing roster of excellent lightweight tools from David.) Once I completed adopting Exhibit, I decided to find an appropriate set of chronological or time-series data to play next with Timeline.
I had earlier been ruminating on one of the great intellectual mysteries of human development: Why, roughly beginning in 1820 to 1850 or so, did the historical economic growth patterns of all prior history suddenly take off? I first wrote on this about two years ago in The Biggest Disruption in History: Massively Accelerated Growth Since the Industrial Revolution, with a couple of follow-ups and expansions since then.
I realized that in developing my thesis that wood pulp paper and mechanized printing were the key drivers for this major inflection change in growth (as they effected literacy and the broadscale access to written information) I already had the beginnings of a listing of various information innovations throughout history. So, a bit more than a year ago, I began adding to that list in terms of how humans learned to write, print, share, organize, collate, reproduce and distribute information and when those innovations occurred.
There are now about 100 items in this listing (I’m still looking for and researching others; please send suggestions at any time. ). Here are some of the current items in chronological order from upper left to lower right:
|calendars||tree diagram||encyclopedia||pencil (mass produced)|
|cuneiform||quill pen||capitalization||rotary perfection press|
|papyrus (paper)||library catalog||magazines||catalogues|
|hieroglyphs||movable type||taxonomy (binomial classification)||typewriter|
|alphabet||paper (rag)||timeline||chemical pulp (sulfite)|
|Phaistos Disc||word spaces||data graphs||classification (Dewey)|
|scrolls||printing press||punch cards||kraft process (pulp)|
|manuscripts||advertising (poster)||steam-powered (mechanized) papermaking||flexography|
|glossaries||bookbinding||book (machine-paper)||classification (LoC)|
|dictionaries||pagination||chemcial symbols||classification (UDC)|
|parchment (paper)||punctuation||mechanical pencil||offset press|
|bibliographies||library catalog (printed)||chromolithography||screenprinting|
|concept of categories||public lending library||paper (wood pulp)||ballpoint pen|
|library||dictionaries (alphabetic)||rotary press||xerographic copier|
|classification system (library)||newspapers||mail-order catalog||hyperlink|
|zero||Information graphics||fountain pen||metadata (MARC)|
So, off and on, I have been working with and updating the data and display of this timeline in draft. (I may someday also post my notes about how to effectively work with the Timeline widget.)
With the listing above, completion was sufficient to finally post this version. One of the neat things with Timeline is the ability to drive the display from a simple XML listing. I will update the timeline when I next have an opportunity to fill in some of the missing items still remaining on my innovations list such as alphabeticization, citations, and table of contents, among many others.
Of course, rarely can an innovation be traced to a single individual or a single moment in time. Historians are increasingly documenting the cultural milieu and multiple individuals that affect innovation.
In these regards, then, a timeline such as this one is simplistic and prone to much error and uncertainty. We have no real knowledge, for examples, for the precise time certain historical innovations occurred, and others (the ballpoint pen being one case in point) are a matter of interpretation as to what and when constituted the first expression. For instances where the record indicated multiple dates, I chose to use the date when released to the publlic.
Nonetheless, given the time scales here of more than 30,000 years, I do think broad trends and rough time frames can be discerned. As long as one interprets this timeline as indicative and not meant as definitive in any scholary sense, I believe this timeline can inform and provide some insight and guidance for how information has evolved over human history.
The operation of Timeline is pretty straightforward and intuitive. Here are a couple of tips to get a bit more out of playing with it:
For the sake of consistency, nearly all entries and pictures on the timeline are drawn from the respective entries within Wikipedia. Subsequent updates may add to this listing by reference to original sources, at which time all sources will be documented.
The fantastic Timeline was developed by David Huynh while he was a graduate student at MIT. Timeline and its sibling widgets were developed under funding from MIT’s Simile program. Thanks to all in the program and best wishes for continued funding and innovation.
Finally, my sincere thanks go to Professor Michael Buckland of the School of Information at the University of California, Berkeley, for his kind suggestions, input and provision of additonal references and sources. Of course, any errors or omissions are mine alone. I also thank Professor Buckland for his admonitions about use and interpretation of the timeline dates.
The recent LinkedData Planet conference in NYC marked, I think, a real transition point. The conference signaled the beginning movement of the Linked Data approach from the research lab to the enterprise. As a result, there was something of a schizophrenic aspect at many different levels to the conference: business and research perspectives; realists and idealists; straight RDF and linked data RDF; even the discussions in the exhibit area versus some of the talks presented from the podium.
Like any new concept, my sense was a struggle around terminology and common language and the need to bridge different perspectives and world views. Like all human matters, communication and dialog were at the core of the attendees’ attempts to bridge gaps and find common ground. Based on what I saw, much great progress occurred.
The reality, of course, is that Linked Data is still very much in its infancy, and its practice within the enterprise is just beginning. Much of what was heard at the conference was theory versus practice and use cases. That should and will change rapidly.
In an attempt to help move the dialog further, I offer a definition and Zitgist’s perspective to some of the questions posed in one way or another during the conference.
Sources such as the four principles of Linked Data in Tim Berners-Lee’s Design Issues: Linked Data and the introductory statements on the Linked Data Wikipedia entry approximate — but do not completely express — an accepted or formal or “official” definition of Linked Data per se. Building from these sources and attempting to be more precise, here is the definition of Linked Data used internally by Zitgist:
All references to Linked Data below embrace this definition.
I’m sure many other questions were raised, but listed below are some of the more prominent ones I heard in the various conference Q&A sessions and hallway discussions.
Yes. Though other approaches can also model the first order predicate logic of subject-predicate-object at the core of the Resource Description Framework data model, RDF is the one based on the open standards of the W3C. RDF and FOL are powerful because of simplicity, ability to express complex schema and relationships, and suitability for modeling all extant data frameworks for unstructured, semi-structured and structured data.
No. Linked Data represents a set of techniques applied to the RDF data model that names all objects as URIs and makes them accessible via the HTTP protocol (as well as other considerations; see the definition above and further discussion below).
Some vendors and data providers claim Linked Data support, but if their data is not accessible via HTTP using URIs for data object identification, it is not Linked Data. Fortunately, it is relatively straightforward to convert non-compliant RDF to Linked Data.
There are some excellent references for how to publish Linked Data. Examples include a tutorial, How to Publish Linked Data on the Web, and a white paper, Deploying Linked Data, using the example of OpenLink’s Virtuoso software. There are also recommended approaches and ways to use URI identifiers, such as the W3C’s working draft, Cool URIs for the Semantic Web.
However, there are not yet published guidelines for also how to meet the Zitgist definition above where there is also an emphasis on class and context matching. A number of companies and consultants, including Zitgist, presently provide such assistance.
The key principles, however, are to make links aggressively between data items with appropriate semantics (properties or relations; that is, the predicate edges between the subject and object nodes of the triple) using URIs for the object identifiers, all being exposed and accessible via the HTTP Web protocol.
Absolutely not, though this is a source of some confusion at present.
The Semantic Web is probably best understood as a vision or goal where semantically rich annotation of data is used by machine agents to make connections, find information or do things automatically in the background on behalf of humans. We are on a path toward this vision or goal, but under this interpretation the Semantic Web is more of a process than a state. By understanding that the Semantic Web is a vision or goal we can see why a label such as ‘Web 3.0′ is perhaps simplistic and incomplete.
Linked Data is a set of practices somewhere in the early middle of the spectrum from the initial Web of documents to this vision of the Semantic Web. (See my earlier post at bottom for a diagram of this spectrum.)
Linked Data is here today, doable today, and pragmatic today. Meaningful semantic connections can be made and there are many other manifest benefits (see below) with Linked Data, but automatic reasoning in the background or autonomic behavior is not yet one of them.
Strictly speaking, then, Linked Data represents doable best practices today within the context both of Web access and of this yet unrealized longer-term vision of the Semantic Web.
Definitely not, though early practice has been interpreted by some as such.
One of the stimulating, but controversial, keynotes of the conference was from Dr. Anant Jhingran of IBM, who made the strong and absolutely correct observation that Linked Data requires the interplay and intersection of people, instances and schema. From his vantage, early exposed Linked Data has been dominated by instance data from sources such as Wikipedia and have lacked the schema (class) relationships that enterprises are based upon. The people aspect in terms of connections, collaboration and joint buy-in is also the means for establishing trust and authority to the data.
In Zitgist’s terminology, class-level mappings ‘explode the domain’ and produce information benefits similar to Metcalfe’s Law as a function of the degree of class linkages . While this network effect is well known to the community, it has not yet been shown much in current Linked Data sets. As Anant pointed out, schemas define enterprise processes and knowledge structures. Demonstrating schema (class) relationships is the next appropriate task for the Linked Data community.
In an RDF context, “ontologies” are the vocabularies and structures that capture the schema structures noted above. Ontologies embody the class and instance definitions and the predicate (property) relations that enable legacy schemas and data to be transformed into Linked Data graphs.
Though many public RDF vocabularies and ontologies presently exist, and should be re-used where possible and where the semantics match the existing legacy information, enterprises will require specific ontologies reflective of their own data and information relationships.
Despite the newness or intimidation perhaps associated with the “ontology” term, ontologies are no more complex — indeed, are simpler and more powerful — than the standard relational schema familiar to enterprises. If you’d like, simply substitute schema for ontology and you will be saying the same thing in an RDF context.
Neither, really, though the rationale and justification for Linked Data is grounded in federating widely disparate sources of data that can also vary widely in existing formalism and structure.
Because Linked Data is a set of techniques and best practices for expressing, exposing and publishing data, it can easily be applied to either centralized or federated circumstances.
However, the real world where any and all potentially relevant data can be interconnected is by definition a varied, distributed, and therefore federated world. Because of its universal RDF data model and Web-based techniques for data expression and access, Linked Data is the perfect vehicle, finally, for data integration and interoperability without boundaries.
The simple case is where two data sources refer to the exact same entity or instance (individual) with the same identity. The standard sameAs predicate is used to assert the equivalence in such cases.
The more important case is where the data sources aresimilar subjects or concepts, in which case a structure of well-defined reference classes is employed. Furthermore, if these classes can themselves be expressed in a graph structure capturing the relationships amongst the concepts, we now have some fixed points in the conceptual information space for relating and tieing together disparate data. Still further, such a conceptual structure also provides the means to relate the people, places, things, organizations, events, etc., of the individual instances of the world to one another as well.
Any reference structure that is composed of concept classes that are properly related to each other may provide this referential “glue” or “backbone”.
One such structure provided in open source by Zitgist is the 21,000 subject concept node structure of UMBEL, itself derived from the Cyc knowledge base. In any event, such broad reference structures may often be accompanied by more specific domain conceptual ontologies to provide focused domain-specific context.
No, absolutely not.
While, to date, it is the case that Linked Data has been demonstrated using public Web data and many desire to expose more through the open data movement, there is nothing preventing private, proprietary or subscription data from being Linked Data.
The Linking Open Data (LOD) group formed about 18 months ago to showcase Linked Data techniques began with open data. As a parallel concept to sever the idea that it only applies to open data, François-Paul Servant has specifically identified Linking Enterprise Data (and see also the accompanying slides).
For example, with Linked Data (and not the more restrictive LOD sense), two or more enterprises or private parties can legitimately exchange private Linked Data over a private network using HTTP. As another example, Linked Data may be exchanged on an intranet between different departments, etc.
So long as the principles of URI naming, HTTP access, and linking predicates where possible are maintained, the approach qualifies as Linked Data.
Absolutely yes, without reservation. Indeed, non-transactional legacy data perhapsbe expressed as Linked Data in order to gain its manifest benefits. See #14 below.
Of course. Since Linked Data can be applied to any data formalism, source or schema, it is perfectly suited to integrating data from inside and outside the firewall, open or private.
The basic query language for Linked Data is SPARQL (pronounced “sparkle”), which bears close resemblance to SQL only applicable to an RDF data graph. The actual datastores applied to RDF may also add a fourth aspect to the tuple for graph namespaces, which can bring access and scale efficiencies. In these cases, the system is known as a “quad store”. Additional techniques may be added to data filtering prior to the SPARQL query for further efficiencies.
Templated SPARQL queries and other techniques can lead to very efficient and rapid deployment of various Web services and reports, two techniques often applied by Zitgist and other vendors. For example, all Zitgist DataViewer views and UMBEL Web services are expressed using such SPARQL templates.
This SPARQL templating approach may also be combined with the use of templating standards such as Fresnel to bind instance data to display templates.
In Zitgist’s view, access control or security occurs at the layer of the HTTP access and protocols, and not at the Linked Data layer. Thus, the same policies and procedures that have been developed for general Web access and security are applicable to Linked Data.
However, standard data level or Web server access and security Virtuoso universal server that has proven and robust security mechanisms. Additionally, it is possible to express security and access policies using RDF ontologies as well. These potentials are largely independent of Linked Data techniques.be enhanced by the choice of the system hosting the data. Zitgist, for example, uses OpenLink’s
The key point is that there is nothing unique or inherent to Linked Data with respect to access or control or security that is not inherent with standard Web access. If a given link points to a data object from a source that has limited or controlled access, its results will not appear in the final results graph for those users subject to access restrictions.
For more than 30 years — since the widespread adoption of electronic information systems by enterprises — the Holy Grail has been complete, integrated access to all data. With Linked Data, that promise is now at hand. Here are some of the key enterprise benefits to Linked Data, which provide the rationales for adoption:
Linked Data is well suited to traditional knowledge base or knowledge management applications. Its near-term application to transactional or material process applications is less apparent.
Of special use is the value-added from connecting existing internal and external content via the network effect from the linkages .
Johnnie Linked Data is starting to grow up. Our little semantic Web toddler is moving beyond ga-ga-goo-goo to saying his first real sentences. Language acquisition will come rapidly, and, like what all of us have seen with our own children, they will grow up faster than we can imagine.
There were so many at this meeting that had impact and meaning to this exciting transition point that I won’t list specific names at risk of leaving other names off. Those of you who made so many great observations or stayed up late interacting with passion know who you are. Let me simply say: Thanks!
The LinkedData Planet conference has shown, to me, that enterprises are extremely interested in what our community has developed and now proven. They are asking hard questions and will be difficult task masters, but we need to listen and respond. The attendees were a selective and high-quality group, understanding of their own needs and looking for answers. We did an OK job of providing those answers, but we can do much, much better.
I reflect on these few days now knowing something I did not truly know before: the market is here and it is real. The researchers who have brought us to this point will continue to have much to research. But, those of us desirous of providing real pragmatic value and getting paid for it, can confidently move forward knowing both the markets and the value are real. Linked Data is not magic, but when done with quality and in context, it delivers value worth paying for.
To all of the fellow speakers and exhibitors, to all of the engaged attendees, and to the Juperitermedia organizers and Bob DuCharme and Ken North as conference chairs, let me add my heartfelt thanks for a job well done.
The next LinkedData Planet conference and expo will be October 16-17, 2008, at the Santa Clara Hyatt in Santa Clara, California. The agenda has not been announced, but hopefully we will see a continuing enterprise perspective and some emerging use cases.
Zitgist as a company will continue to release and describe its enterprise products and services, and I will continue to blog on Linked Data matters of specific interest to the enterprise. Pending topics include converting legacy data to Linked Data, converting relational data and schema to Linked Data, placing context to Linked Data, and many others. We think you will like the various announcements as they arise.
Zitgist is also toying with the use of a distinctive icon to indicate the availability of Linked Data conforming to the principles embodied in the questions above. (The color choice is an adoption of the semantic Web logo from the W3C.) The use of a distinctive icon is similar to what RSS feeds or microformats have done to alert users to their specific formats. Drop me a line and let us know what you think of this idea.
In a recent blog post, Kingsley Idehen picked up on the UMBEL project’s mantra of “context, Context, CONTEXT!” as contained in our recent slideshow. He likened context to the real estate phrase of “location, location, location”. Just so. I like Kingsley’s association because it reinforces the idea that context places concepts and things into some form of referential road map with respect to other things and concepts.
To me, context describes the relationships and environmental proximities of what UMBEL calls subject concepts and their instance sub-concepts and named entity members, the whole of which might be visualized as a graph of reference nodes in the firmament of a global knowledge space.
Indeed, it is this very ‘cloud’ of subject concept nodes that we tried to convey in an earlier piece on what UMBEL’s backbone structure of 21,000 subject concepts might look like, shown at right. (Of course, this visualization results from the combination of UMBEL’s OpenCyc contextual framework and specific modeling algorithms; the graph would vary considerably if based on other frameworks or models.)
Yet in a comment to Kingsley’s post, Giovanni Tummarello said, “If you believe in context so much then the linking open data idea goes bananas. Why? Because ‘sameAs’ is fundamentally wrong.. an entity on DBpedia IS NOT sameAs one on GeoNames because the context is different and bla bla… so it all crumbles.” 
Well, hmmm. I must beg to differ.
I suspect as we now are seeing Linked Data actually enter into practice, new implications and understandings are coming to the fore. And, as we try new approaches, we also sometimes suffer from the sheer difficulty of explicating those new understandings in the context of the shaky semantics of the semantic Web.
Giovanni’s comment raises two issues:
Therefore, since UMBEL is putting forth the argument for the importance of context in Linked Data, it is appropriate to be precise about the semantics of what is meant.
What is context? The tenth edition of Merriam-Websters Collegiate Dictionary (and the online version) defines it as:
context \ˈkän-ˌtekst\ n.;ME, weaving together of words, Latin contextus connection of words, coherence, from contexere to weave together, from com- + texere to weave ( ca.1586)
1: the parts of a discourse that surround a word or passage and can throw light on its meanings
2: the interrelated conditions in which something exists or occurs: environment, setting <the historical context of the war>.
Both of these references, of course, base their perspective on language and language relationships. But, both also provide the useful perspective that context also conveys the senses of environment, surroundings, interrelationships, connections and coherence.
Context has itself been a focus of much research from linguistics to philosophy and computer science. Each field has its specific take on the concept, but I believe it fair to say that context is consensually used as a holistic reference structure that tries to put all worlds and views, including that of the observer and observed, into a consistent framework. Indeed, when that framework and its assertions fit and make sense, we give that a word, too: coherent.
Hu Yijun , for example, intersects the interplay of language, semantics and the behavior and circumstances of human actors to frame context. Yijun observes that an invariably-applied research principle is that meaning is determined by context. Context refers to environmental conditions surrounding a discourse and its parts which are related with it, and provides the framework to interpret that discourse. There are world views, relationships and interrelationships, and assertions by human actors that combine to establish the context of those assertions and the means to interpret them.
In the concept of context, therefore, we see all of the components and building blocks of RDF itself. We have things or concepts (subjects or objects) that are related to one another (via properties or predicates) to form the basic assertions (triples). These are combined together and related in still more complex structures attempting to capture a world view or domain (ontology). These assertions have trust and credibility based on the actors (provenance) that make them.
In short, context is the essence of the semantic Web and Linked Data, not somehow in variance or conflict with it.
Without context, there is no meaning.
While one interpretation might be that the characteristics of one individual (say, Quebec City) might be oriented to latitude and longitude in a GeoNames source, while the characteristics of that individual may have a different context (say, population or municipal government) in the different DBpedia (Wikipedia) source, we need to be very careful of what is meant by context here. The identity of the individual (Quebec City) remains the same in both sources. The context does not change the individual nor its identity, only the nature of the characteristics used to provide different coherent information about it.
With the growth in Linked Data, we are starting to hear the rumblings around possible misuse and misapplication of the sameAs predicate . Frankly, this is good, because I share the view there has been some confusion regarding the predicate and misapplications given its semantics.
The built-in OWL property owl:sameAs links an individual to an individual . Such an
owl:sameAs statement indicates that two URI references actually refer to the same thing: the individuals have the same “identity”.
A link is a predicate is an assertion. It by nature ties (“glues”) two resources to one another. Such an assertion can either: (1) be helpful and “correct”; (2) be made incorrectly; (3) assert the wrong or perhaps semantically poor relationship; or (4) be used maliciously or to deceive.
(Unlike email spam, #4 above has not occurred anywhere to my knowledge for Linked Data. Unfortunately, and most sadly, deceitful links will occur at some point, however. This inevitability is a contingency the community must be cognizant of as it moves forward.)
To date, almost all inter-source Linked Data links have occurred via owl:sameAs. If we liken this situation to early child language acquisition, it is like we only have one verb to describe the world. And because our vocabulary is relatively spare, we have tended to apply sameAs to situations and relations that, comparatively, have a bit of semblance to baby-talk.
So long as we have high confidence two disparate sources are referring to the same individual with the same identity, sameAs is the semantically correct RDF link. In all other cases, the use of this predicate should be suspect.
Simple string or label matches are insufficient to make a sameAs assertion. If sameAs can not be confidently asserted, as might be the case where the relation of individual referents is perhaps likely but uncertain, we need to invoke new predicates or make no assertion at all. And, if the resources at hand are not individuals at all but classes, the need for new semantics increases still further.
As we increase the size of the Linked Data ‘cloud’ or show rapid growth in Linked Data, we should be aware that quality, not size, may be the most important metric powering acceptance. The community has made unbelievable progress in finally putting real data behind the semantic Web promise. The challenge now is to add to our vocabulary and ensure quality assertions for the linkages we publish.
One of UMBEL’s purposes, for example, is to broaden our relations to the class level of subject concepts. As we move beyond the early days of FOAF and other early vocabularies, we will see further richening of our predicates. We also need predicates and predicate language that reflects the open-world nature  of public Linked Data and the semantic Web.
So, while sameAs helps us aggregate related information about the same identifiable individual, the predicates of class relations in context to other classes helps to put all information into context. And, if done right — that is, if the semantics and assertions are relatively correct — these desired contextual relations and interlinkages can blossom.
The new predicates forthcoming from the UMBEL project, to be published with technical documentation this month, and related to these purposes will include:
Assertions such as these that are open to ambiguity or uncertainty, while appropriate for much of the open-world nature of the semantic Web, may also be difficult predicates for the community to achieve consensus. Like our early experience with sameAs, these predicates — or others that can just as easily arise in their stead — will certainly prove subject to some growing pains.
Most people active in the semantic Web and Linked Data communities believe a decentralized Web environment leads to innovation and initiative. Open software, standards activities, and vigorous community participation affirm these beliefs daily.
The idea of context and global frames of reference, such as represented by UMBEL or perhaps any contextual ontology, could appear to be at odds with those ideals of decentralization. But one paradox is that without context, the basis for RDF linkages is made much poorer and therefore the potential for the benefits (and thus adoption) of Linked Data lessen.
The object lesson should therefore not be a rejection of context. Indeed, any context is better than no context at all.
Of course, whether that context gets provided by UMBEL or by some other framework(s) remains to be seen. This is for the market to decide. But the ability of contextual frameworks to richen our semantics should be clear.
The past year with the growth and acceptance of Linked Data have affirmed that the mechanisms for linking and relating data are now largely in place. We have a simple, yet powerful and extensible data model in RDF. We have beginning vocabularies and constructs for conducting the data discourse. We have means for moving legacy data and information into this promising new environment.
Context and Linked Data are not in any way at odds, nor are context and sameAs. Indeed, context itself is an essential framework for how we can orient and grow our semantics. Human language required its referents in the real world in order to grow and blossom. Context is just as essential to derive and grow the semantics and meaning of the semantic Web.
The early innovators of the Linked Open Data community are the very individuals best placed to continue this innovation. Let’s accept sameAs for what it is — one kind of link in a growing menagerie of RDF link predicates — and get on with the mission of putting our enterprise in context. I think we’ll find our data has a lot more meaningfully to say — and with more coherence.
UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight reference structure for placing Web content, named entities and data in context with other data. It is comprised of about 21,000 subject concepts and their relationships — with one another and with external vocabularies and named entities.
I still never cease to be amazed at how wonderful and powerful tools are so often and easily overlooked. The most recent example is Cytoscape, a winner in our recent review of more than 25 tools for large-scale RDF graph visualization.
We began this review because the UMBEL subject concept “backbone” ontology will involve literally thousands of concepts. Graph visualization software suitable to very large graphs would aid UMBEL’s construction and refinement.
Cytoscape describes itself as a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. Cytoscape is partially based on GINY and Piccolo, among other open-source toolkits. What is more important to our immediate purposes, however, is that its design also lends itself well to general network and graph manipulation.
Cytoscape was first brought to our attention by François Belleau of Bio2RDF.org. Thanks François, and also for the strong recommendation and tips. Special thanks are also due to Frédérick Giasson of Zitgist for his early testing and case examples. Thanks, Fred!
We had a number of requirements and items on our wish list prior to beginning our review. We certainly did not expect most or all of these items to be met:
Cytoscape met or exceeded our wish list in all areas save one: it does not support direct ingest of RDF (other than some pre-set BioPAX formats). However, that proved to be no obstacle because of the clean input format support of the tool. Simple parsing of triples into a CSV file is sufficient for input. Moreover, as described below, there are other cool attribute management functions that this clean file format supports as well.
The following screen shot shows the major Cytoscape screen. We will briefly walk through some of its key views (click for full size):
This Java tool has a fairly standard Eclipse-like interface and design. The main display window (A) shows the active portion of the current graph view. (Note that in this instance we are looking at a ‘Spring’ layout for the same Music sub-graph presented above.) Selections can easily be made in this main display (the red box) or by directly clicking on a node. The display itself represents a zoom (B) of the main UMBEL graph, which can also be easily panned (the blue box on B) or itself scaled (C). Those items that are selected in the main display window also appear as editable nodes or edges and attributes in the data editing view (D).
The appearance of the graph is fully editable via the VizMapper (E). An interesting aspect here is that every relation type in the graph (its RDF properties, or predicates) can be visually displayed in a different manner. The graphs or sub-graphs themselves can be selected, but also most importantly, the display can respond to a very robust and flexible filtering framework (F). Filters can be easily imported and can apply to nodes, edges (relations), the full graph or other aspects (depending on plugin). A really neat feature is the ability to search the graph in various flexible ways (G), which alters the display view. Any field or attribute can be indexed for faster performance.
In addition to these points, Cytoscape supports the following features:
The Cytoscape project also offers:
Unfortunately, other than these official resources, there appears to be a dearth of general community discussion and tips on the Web. Here’s hoping that situation soon changes!
There is a broad suite of plugins available for Cytoscape, and directions to developers for developing new ones.
The master page also includes third-party plugins. The candidates useful to UMBEL and its graphing needs — also applicable to standard semantic Web applications — appear to be:
Importantly, please note there is a wealth of biology- and molecular-specific plugins also available that are not included in the generic listing above.
Our initial use of the tool suggests some use tips:
Cytoscape was first released in 2002 and has undergone steady development since. Most recently, the 2.x and especially 2.3 versions forward have seen a flurry of general developments that have greatly broadened the tool’s appeal and capabilities. It was perhaps only these more recent developments that have positioned Cytoscape for broader use.
I suspect another reason that this tool has been overlooked by the general semWeb community is the fact that its sponsors have positioned it mostly in the biological space. Their short descriptor for the project, for example, is: Cytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. That statement hardly makes it sound like a general tool!
Another reason for the lack of attention, of course, is the common tendency for different disciplines not to share enough information. Indeed, one reason for my starting the Sweet Tools listing was hopefully as a means of overcoming artificial boundaries and assembling relevant semantic Web tools in one central place.
Yet despite the product’s name and its positioning by sponsors, Cytoscape is indeed a general graph visualization tool, and arguably the most powerful one reviewed from our earlier list. Cytoscape can easily accommodate any generalized graph structure, is scalable, provides all conceivable visualization and modeling options, and has a clean extension and plugin framework for adding specialized functionality.
With just minor tweaks or new plugins, Cytoscape could directly read RDF and its various serializations, could support processing any arbitrary OWL or RDF-S ontology, and could support other specific semWeb-related tasks. As well, a tool like CPath (http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1660554), which enables querying of biological databases and then storing them in Cytoscape format, offers some tantalizing prospects for a general model for other Web query options.
For these reasons, I gladly announce Cytoscape as the next deserving winner of the (highly coveted, but cheesy! ) AI3 Jewels & Doubloons award.
Cytoscape’s sponsors — the U.S. National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH), the U.S. National Science Foundation (NSF) and Unilever PLC — and its developers — the Institute for Systems Biology, the University of California – San Diego, the Memorial Sloan-Kettering Cancer Center, L’Institut Pasteur and Agilent Technologies – are to be heartily thanked for this excellent tool!
|An AI3 Jewels & Doubloons Winner|