Marko Rodriguez has been one of the most exciting voices in graph applications and theory with relevance to the semantic Web over the past five years. He is personally innovating an entire ecosystem of graph systems and tools for which all of us should be aware.
The other thing about Marko I like is that he puts thoughtful attention and graphics to all of his posts. (He also likes logos and whimsical product names.) The result is that, when he presents a new post, it is more often than not a gem.
Today Marko posted what I think is a keeper on graph-related stuff:
I personally think it is a nice complement to my own Age of the Graph of a few months back. In any event, put Marko’s blog in your feed reader. He is one of the go-to individuals in this area.
Virtually everywhere one looks we are in the midst of a transition for how we organize and manage information, indeed even relationships. Social networks and online communities are changing how we live and interact. NoSQL and graph databases — married to their near cousin Big Data — are changing how we organize and store information and data. Semantic technologies, backed by their ontologies and RDF data model, are showing the way for how we can connect and interoperate disparate information in ways only dreamed about a decade ago. And all of this, of course, is being built upon the infrastructure of the Internet and the Web, a global, distributed network of devices and information that is undoubtedly one of the most important technological developments in human history.
There is a shared structure across all of these developments — the graph. Graphs are proving to be the new universal paradigm for how we organize and manage information. Graphs have an inherently expandable nature, and one which can also capture any existing structure. So, as we see all of the networks, connections, relationships and links — both physical and informational — grow around us, it is useful to step back a bit and contemplate the universal graph structure at the core of these developments.
Understanding that we now live in the Age of the Graph means we can begin studying and using the concept of the graph itself to better analyze and manage our interconnected world. Whether we are trying to understand the physical networks of supply chains and infrastructure or the information relationships within ontologies or knowledge graphs, the various concepts underlying graphs and graph theory, themselves expressed through a rich vocabulary of terms, provide the keys for unlocking still further treasures hidden in the structure of graphs.
The use of “graph” as a mathematical concept is not much more than 100 years old. The beginning explication of the various classes of problems that can be addressed by graph theory probably is no older than 300 years. The use of graphs for expressing logic structures probably is not much older than 100 years, with the intellectual roots beginning with Charles Sanders Peirce . Though likely trade routes and their affiliated roads and primitive transportation or nomadic infrastructures were perhaps the first expressions of physical networks, the emergence and prevalence of networks is a fairly recent phenomenon. The Internet and the Web are surely the catalyzing development that has brought graphs and networks to the forefront.
In mathematics, a graph is an abstract representation of a set of objects where pairs of the objects are connected. The objects are most often known as nodes or vertices; the connections between the objects are called edges. Typically, a graph is depicted in diagrammatic form as a set of dots or bubbles for the nodes, joined by lines or curves for the edges. If there is a logical relationship between connected nodes the edge is directed, and the graph is known as a directed graph. Various structures or topologies can be expressed through this conceptual graph framework. Graphs are one of the principle focuses of study in discrete mathematics . The word “graph” was first used in the sense as a mathematical structure by J.J. Sylvester in 1878 .
As representative of various data models, particularly in our company’s own interests in the Resource Description Framework (RDF) model, the nodes can represent “nouns” or subjects or objects (depending on the direction of the links) or attributes. The edges or connections represent “verbs” or relationships, properties or predicates. Thus, the simple “triple” of the basic statement in RDF (consisting of subject – predicate – object) is one of the constituent barbells that make up what becomes the eventual graph structure.
The manipulation and analysis of graph structures comes under the rubric of graph theory. The first recognized paper in that field is the Seven Bridges of Königsberg, written by Leonhard Euler in 1736. The objective of the paper was to find a walking path through the city that would cross each bridge once and only once. Euler proved that the problem has no solution:
Euler’s approach represented the path problem as a graph, by treating the land masses as nodes and the bridges as edges. Euler’s proof postulated that if every bridge has been traversed exactly once, it follows that, for each land mass (except for the ones chosen for the start and finish), the number of bridges touching that land mass must be even (the number of connections to a node we now call “degree”). Since that is not true for this instance, there is no solution. Other researchers, including Leibniz, Cauchy and L’Huillier applied this approach to similar problems, leading to the origin of the field of topology.
Later, Cayley broadened the approach to study tree structures, which have many implications in theoretical chemistry. By the 20th century, the fusion of ideas coming from mathematics with those coming from chemistry formed the origin of much of the standard terminology of graph theory.
Graph theory forms the core of network science, the applied study of graph structures and networks. Besides graph theory, the field draws on methods including statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. Classical problems embraced by this realm include the four color problem of maps, the traveling salesman problem, and the six degrees of Kevin Bacon.
Graph theory and network science are the suitable disciplines for a variety of information structures and many additional classes of problems. This table lists many of these applicable areas, most with links to still further information from Wikipedia:
|Graph Structures||Graph Problems|
Subgraphs, induced subgraphs, and minors
Search and navigation
Subsumption and unification
Route (path) problems
Visibility graph problems
Graphs are among the most ubiquitous models of both natural and human-made structures. They can be used to model many types of relations and process dynamics in physical, biological and social systems. Many problems of practical interest can be represented by graphs. This breadth of applicability makes network science and graph theory two of the most critical analytical areas for study and breakthroughs for the foreseeable future. I touch on this more in the concluding section.
Surely the first examples of graph structures were early trade and nomadic routes. Here, for example, are the trade routes of the Radhanites dating from about 870 AD :
It is not surprising that routes such as these, or other physical networks as exemplified by the bridges of Königsberg, were the stimulus for early mathematics and analysis related to efficient use of networks. Minimizing the time to complete a trade circuit or visiting multiple markets efficiently has clear benefits. These economic rationales apply to a wide variety of modern, physical networks, including:
Of course, included in the latter category is the Internet itself. It is the largest graph in existence, with an estimated 2.2 billion users and their devices all connected in one way or another in all parts of the globe .
Graphs and graph theory also have broad applicability to natural systems. For example, graph theory is used extensively to study molecular structures in chemistry and physics. A graph makes a natural model for a molecule, where vertices represent atoms and edges bonds. Similarly, in biology or ecology, graphs can readily express such systems as species networks, ecological relationships, migration paths, or the spread of diseases. Graphs are also proper structures for modeling biological and chemical pathways.
Some of the exemplar natural systems that lend themselves to graph structures include:
As with physical networks, a graph representation for natural systems provides real benefits in computer processing and analysis. Once expressed as a graph, all graph algorithms and perspectives from graph theory and network science can be brought to bear. Statistical methods are particularly applicable to representing connections between interacting parts of a system, as well to representing the physical dynamics of natural systems.
Parallel with the growth of the Internet and Web has been the growth of social networks. Social network analysis (SNA) has arguably been the single most important driver for advances in graph theory and analysis algorithms in recent years. New and interesting problems and challenges — from influence to communities to conflicts — are now being elucidated through techniques pioneered for SNA.
Second only in size to the Internet has been the graph of interactions arising from Facebook. Facebook had about 900 million users as of May 2012, half of which accessed the service via mobile devices . Facebook famously embraced the graph with its own Open Graph protocol, which makes it easy for users to access and tie into Facebook’s social network. A representation of the Facebook social graph as of December 2010 is shown in this well-known figure:
The suitability of the graph structure to capture relationships has been a real boon to better understanding of social and community dynamics. Many new concepts have been introduced as the result of SNA, including such things as influence, diversity, centrality, cliques and so forth. (The opening diagram to this article, for example, models centrality, with blue the maximum and red the minimum.)
Particular areas of social interaction that lend themselves to SNA include:
Entirely new insights have arisen from SNA including finding terrorist leaders, analyzing prestige, or identifying keystone vendors or suppliers in business ecosystems.
Given the ubiquity of graphs as representations of real systems and networks, it is certainly not surprising to see their use in computer science as as means for information representation. We already saw in the table above the many data structures that can be represented as graphs, but the paradigm has even broader applicability.
The critical breakthroughs have come through using the graph as a basis for data models and logic models. These, in turn, provide the basis for crafting entire graph-based vocabularies and languages. Once such structures are embraced, it is a natural extension to also extend the mindset to graph databases as well.
Some of the notable information representations that have a graph as their basis include:
A key point of graphs noted earlier was their inherent extensibility. Once graphs are understood as a great basis for representing both logic and data structures, it is a logical next step to see their applicability extend to knowledge representations and knowledge bases as well.
Graph-theoretic methods have proven particularly useful in linguistics, since natural language often lends itself well to discrete structure. So, not only can graphs represent syntactic and compositional structure, but they can also capture the interrelationships of terms and concepts within those languages. The usefulness of graph theory to linguistics is shown by the various knowledge bases such as WordNet (in various languages) and VerbNet.
Domain ontologies are similar structures, capturing the relationships amongst concepts within a given knowledge domain. These are also known as knowledge graphs, and Google has famously just released its graph of entities to the world . Semantic networks and neural networks are similar knowledge representations.
The following interactive diagram, of the UMBEL knowledge graph of about 25,000 reference concepts for helping to orient disparate datasets , shows that some of these graph structures can get quite large:
What all of these examples show is the nearly universal applicability of graphs, from the abstract to the physical, from the small to the large, and every gradation between. We also see how basic graph structures and concepts can be built upon with more structure. This breadth points to the many synergies and innovations that may be transferred from diverse fields to advance the usefulness of graph theories.
Despite the many advances that have occurred in graph theory and the increased attention from social network analysis, many, many graph problems remain some of the hardest in computation. Optimizations, partitioning, mapping, inferencing, traversing and graph structure comparisons remain challenging. And, some of these challenges are only growing due to the growth in the size of networks and graphs.
Applying the lessons of the Internet in such areas as non-relational databases, distributed processing, and big data and map reduce-oriented approaches will help some in this regard. We’re learning how to divide and conquer big problems, and we are discovering data and processing architectures more amenable to graph-based problems.
The fact we have now entered the Age of the Graph also bodes that further scrutiny and attention will lead to more analytic breakthroughs and innovation. We may be in an era of Big Data, but the structure underlying all of that is the graph. And that reality, I predict, will result in accelerated advances in graph theory.
The crowning achievement of the semantc Web is the simple use of URIs to identify data. Further, if the URI identifier can resolve to a representation of that data, it now becomes an integral part of the HTTP access protocol of the Web while providing a unique identifier for the data. These innovations provide the basis for distributed data at global scale, all accessible via Web devices such as browsers and smartphones that are now a ubiquitous part of our daily lives.
Yet, despite these profound and simple innovations, the semantic Web’s designers and early practitioners and advocates have been mired in a muddled, metaphysical argument of at least a decade over what these URIs mean, what they reference, and what their actual true identity is. These muddles about naming and identity, it might be argued, are due to computer scientists and programmers trying to grapple with issues more properly the domain of philosophers and linguists. But that would be unfair. For philosophers and linguists themselves have for centuries also grappled with these same conundrums .
As I argue in this piece, part of the muddle results from attempting to do too much with URIs while another part results from not doing enough. I am also not trying to directly enter the fray of current standards deliberations. (Despite a decade of controversy, I optimistically believe that the messy process of argument and consensus building will work itself out .) What I am trying to do in this piece, however, is to look to one of America’s pre-eminent philosophers and logicians, Charles Sanders Peirce (pronounced “purse”), to inform how these controversies of naming, identity and meaning may be dissected and resolved.
The Web began as a way to hyperlink between documents, generally Web pages expressed in the HTML markup language. These initial links were called URLs (uniform resource locators), and each pointed to various kinds of electronic resources (documents) that could be accessed and retrieved on the Web. These resources could be documents written in HTML or other encodings (PDFs, other electronic formats), images, streaming media like audio or videos, and the like .
All was well and good until the idea of the semantic Web, which postulated that information about the real world — concepts, people and things — could also be referenced and made available for reasoning and discussion on the Web. With this idea, the scope of the Web was massively expanded from electronic resources that could be downloaded and accessed via the Web to now include virtually any topic of human discourse. The rub, of course, was that ideas such as abstract concepts or people or things could not be “dereferenced” nor downloaded from the Web.
One of the first things that needed to change was to define a broader concept of a URI “identifier” above the more limited concept of a URL “locator”, since many of these new things that could be referenced on the Web went beyond electronic resources that could be accessed and viewed . But, since what the referent of the URI now actually might be became uncertain — was it a concept or a Web page that could be viewed or something else? — a number of commentators began to note this uncertainty as the “identity crisis” of the Web . The topic took on much fervor and metaphysical argument, such that by 2003, Sandro Hawke, a staffer of the standards-setting W3C (World Wide Web Consortium), was able to say, “This is an old issue, and people are tired of it” .
Yet, for many of the reasons described more fully below, the issue refused to go away. The Technical Architecture Group (TAG) of the W3C took up the issue, under a rubric that came to be known as httpRange-14 . The issue was first raised in March 2002 by Tim Berners-Lee, accepted for TAG deliberations in February 2003, with then a resolution offered in June 2005 . (Refer to the original resolution and other information  to understand the nuances of this resolution, since particular commentary on that approach is not the focus of this article.) Suffice it to say here, however, that this resolution posited an entirely new distinction of Web content into “information resources” and “non-information resources”, and also recommended the use of the HTTP 303 redirect code for when agents requesting a URI should be directed to concepts versus viewable documents.
This “resolution” has been anything but. Not only can no one clearly distinguish these de novo classes of “information resources” , but the whole approach felt arbitrary and kludgy.
Meanwhile, the confusions caused by the “identity crisis” and httpRange-14 continued to perpetuate themselves. In 2006, a major workshop on “Identity, Reference and the Web” (IRW 2006) was held in conjunction with the Web’s major WWW2006 conference in Edinburgh, Scotland, on May 23, 2006 . The various presentations and its summary (by Harry Halpin) are very useful to understand these issues. What was starting to jell at this time was the understanding that the basis of identity and meaning on the Web posed new questions, and ones that philosophers, logicians and linguists needed to be consulted to help inform.
The fiat of the TAG’s 2005 resolution has failed to take hold. Over the ensuing years, various eruptions have occurred on mailing lists and within the TAG itself (now expressed as Issue 57) to revisit these questions and bring the steps moving forward into some coherent new understanding. Though linked data has been premised on best-practice implementation of these resolutions , and has been a qualified success, many (myself included) would claim that the extra steps and inefficiencies required from the TAG’s httpRange-14 guidance have been hindrances, not facilitators, of the uptake of linked data (or the semantic Web).
Today, despite the efforts of some to claim the issue closed, it is not. Issue 57 and the periodic bursts from notable semantic Web advocates such as Ian Davis , Pat Hayes and Harry Halpin , Ed Summers , Xiaoshu Wang , David Booth  and TAG members themselves, such as Larry Masinter  and Jonathan Rees , point to continued irresolution and discontent within the advocate community. Issue 57 currently remains open. Meanwhile, I think, all of us interested in such matters can express concern that linked data, the semantic Web and interoperable structured data have seen less uptake than any of us had hoped or wanted over the past decade. As I have stated elsewhere, unclear semantics and muddled guidelines help to undercut potential use.
As each of the eruptions over these identity issues has occurred, the competing camps have often been characterized as “talking past one another”; that is, not communicating in such a way as to help resolve to consensus. While it is hardly my position to do so, I try to encapsulate below the various positions and prejudices as I see them in this decades-long debate. I also try to share my own learning that may help inform some common ground. Forgive me if I overly simplify these vexing issues by returning to what I see as some first principles . . . .
One legacy of the initial document Web is the perception that Web addresses have meaning. We have all heard of the multi-million dollar purchasing of domains  and the adjudication that may occur when domains are hijacked from their known brands or trademark owners. This legacy has tended to imbue URIs with a perceived value. It is not by accident, I believe, that many within the semantic Web and linked data communities still refer to “minting” URIs. Some believe that ownership and control over URIs may be equivalent to grabbing up valuable real estate. It is also the case that many believe the “name” given to a URI acts to name the referent to which it refers.
This perception is partially true, partially false, but moreover incomplete in all cases. We can illustrate these points with the global icon, “Coca-Cola”.
As for the naming aspects, let’s dissect what we mean when we use the label “Coca-Cola” (in a URI or otherwise). Perhaps the first thing that comes to mind is “Coca-Cola,” the beverage (which has a description on Wikipedia, among other references). Because of its ubiquity, we may also recognize the image of the Coca-Cola bottle to the left as a symbol for this same beverage. (Though, in the hilarious movie, The Gods, They Must be Crazy, Kalahari Bushmen, who had no prior experience of Coca-Cola, took the bottle to be magical with evil powers .) Yet even as reference to the beverage, the naming aspects are a bit cloudy since we could also use the fully qualified synonyms of “Coke”, “Coca-cola” (small C), “Classic Coke” and the hundreds of language variants worldwide.
On the other hand, the label “Coca-Cola” could just as easily conjure The Coca-Cola Company itself. Indeed, the company web site is the location pointed to by the URI of http://www.thecoca-colacompany.com/. But, even that URI, which points to the home Web page of the company, does not do justice to conveying an understanding or description of the company. For that, additional URIs may need to be invoked, such as the description at Wikipedia, the company’s own company description page, plus perhaps the company’s similar heritage page.
Of course, even these links and references only begin to scratch the surface of what the company Coca-Cola actually is: headquarters, manufacturing facilities, 140,000 employees, shareholders, management, legal entities, patents and Coke recipe, and the like. Whether in human languages or URIs, in any attempt to signify something via symbols or words (themselves another form of symbol), we risk ambiguity and incompleteness.
URI shorteners also undercut the idea that a URI necessarily “names” something. Using the service bitly, we can shorten the link to the Wikipedia description of the Coke beverage to http://bit.ly/xnbA6 and we can shorten the link to The Coca-Cola Company Web site to http://bit.ly/9ojUpL. I think we can fairly say that neither of these shortened links “name” their referents. The most we can say about a URI is that it points to something. With the vagaries of meaning in human languages, we might also say that URIs refer to something, denote something or identify (but not in the sense of completely define) something.
From this discussion, we can assert with respect to the use of URIs as “names” that:
In summary, I think we can say that URIs may act as names, but not in all or most cases, and when used as such are often ambiguous. Absolutely associating URIs as names is way too heavy a burden, and incorrect in most cases.
The “name” discussion above masks that in some cases we are talking about a readable Web document or image (such as the Wikipedia description of the Coke beverage or its image) versus the “actual” thing in the real world (the Coke beverage itself or even the company). This distinction is what led to the so-called “identity crisis”, for which Ian Davis has used a toucan as his illustrative thing .
As I note in the conclusion, I like Davis’ approach to the identity conundrum insofar as Web architecture and linked data guidance are concerned. But here my purpose is more subtle: I want to tease apart still further the apparent distinction between an electronic description of something on the Web and the “actual” something. Like Davis, let’s use the toucan.
In our strawman case, we too use a description of the toucan (on Wikipedia) to represent our “information resource” (the accessible, downloadable electronic document). We contrast to that a URI that we mean to convey the actual physical bird (a “non-information resource” in the jumbled jargon of httpRange-14), which we will designate via the URI of http://example.com/toucan.
Despite the tortured (and newly conjured) distinction between “information resource” and “non-information resource”, the first blush reaction is that, sure, there is a difference between an electronic representation that can be accessed and viewed on the Web and its true, “actual” thing. Of course people can not actually be rendered and downloaded on the Web, but their bios and descriptions and portrait images may. While in the abstract such distinctions appear true and obvious, in the specifics that get presented to experts, there is surprising disagreement as to what is actually an “information resource” v. a “non-information resource” . Moreover, as we inspect the real toucan further, even that distinction is quite ambiguous.
When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we are showing to the right is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how divergent these various “physical birds” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture to the right is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.
The point of this diversion is not a lecture on toucans, but an affirmation that distinctions between “resources” occur at multiple levels and dimensions. Just as there is no self-evident criteria as to what constitutes an “information resource”, there is also not a self-evident and fully defining set of criteria as to what is the physical “toucan” bird. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the context and accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.
In other words, a “resource” may be anything, which is just the way the W3C has defined it. There is not a single dimension which, magically, like “information” and “non-information,” can cleanly and definitely place a referent into some state of absolute understanding. To assert that such magic distinctions exist is a flaw of Cartesian logic, which can only be reconciled by looking to more defensible bases in logic .
The logic behind these distinctions and nuances leads us to Charles Sanders Peirce (1839 – 1914). Peirce (pronounced “purse”) was an American logician, philosopher and polymath of the first rank. Along with Frege, he is acknowledged as the father of predicate calculus and the notation system that formed the basis of first-order logic. His symbology and approach arguably provide the logical basis for description logics and other aspects underlying the semantic Web building blocks of the RDF data model and, eventually, the OWL language. Peirce is the acknowledged founder of pragmatism, the philosophy of linking practice and theory in a process akin to the scientific method. He was also the first formulator of existential graphs, an essential basis to the whole field now known as model theory. Though often overlooked in the 20th century, Peirce has lately been enjoying a renaissance with his voluminous writings still being deciphered and published.
The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” , he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. Peirce had a predilection for expressing his ideas in “threes” throughout his writings.
Semiotics is often split into three branches: 1) syntactics – relations among signs in formal structures; 2) semantics – relations between signs and the things to which they refer; and 3) pragmatics – relations between signs and the effects they have on the people or agents who use them.
Peirce’s logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an example of this process.
A given sign is a representation amongst the triad of the sign itself (which Peirce called a representamen, the actual signifying item that stands in a well-defined kind of relation to the two other things), its object and its interpretant. The object is the actual thing itself. The interpretant is how the agent or the perceiver of the sign understands and interprets the sign. Depending on the context and use, a sign (or representamen) may be either an icon (a likeness), an indicator or index (a pointer or physical linkage to the object) or a symbol (understood convention that represents the object, such as a word or other meaningful signifier).
An interpretant in its barest form is a sign’s meaning, implication, or ramification. For a sign to be effective, it must represent an object in such a way that it is understood and used again. This makes the assignment and use of signs a community process of understanding and acceptance , as well as a truth-verifying exercise of testing and confirming accepted associations.
John Sowa has done much to help make some of Peirce’s obscure language and terminology more accessible to lay readers . He has expressed Peirce’s basic triad of sign relations as follows, based around the Yojo animist cat figure used by the character Queequeg in Herman Melville’s Moby-Dick:
In this figure, object and symbol are the same as the Peirce triad; concept is the interpretant in this case. The use of the word ‘Yojo’ conjures the concept of cat.
This basic triad representation has been used in many contexts, with various replacements or terms at the nodes. Its basic form is known as the Meaning Triangle, as was popularized by Ogden and Richards in 1923 .
The key aspect of signs for Peirce, though, is the ongoing process of interpretation and reference to further signs, a process he called semiosis. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. In the Sowa example below, we show how meaning triangles can be linked to one another, in this case by abstracting that the triangles themselves are concepts of representation; we can abstract the ideas of both concept and symbol:
We can apply this same cascade of interpretation to the idea of the sign (or representamen), which in this case shows that a name can be related to a word symbol, which in itself is a combination of characters in a string called ‘Yojo’:
According to Sowa :
Catherine Legg  notes that the semiotic process is really one of community involvement and consensus. Each understanding of a sign and each subsequent interpretation helps come to a consensus of what a sign means. It is a way of building a shared understanding that aids communication and effective interpretation. In Peirce’s own writings, the process of interpretation can lead to validation and an eventual “canonical” or normative interpretation. The scientific method itself is an extreme form of the semiotic process, leading ultimately to what might be called accepted “truths”.
So, how do Peircean semiotics help inform us about the role and use of URIs? Does this logic help provide guidance on the “identity crisis”?
The Peircean taxonomy of signs has three levels with three possible sign roles at each level, leading to a possible 27 combinations of sign representations. However, because not all sign roles are applicable at all levels, Peirce actually postulated only ten distinct sign representations.
Common to all roles, the URI “sign” is best seen as an index: the URI is a pointer to a representation of some form, be it electronic or otherwise. This representation bears a relation to the actual thing that this referent represents, as is true for all triadic sign relationships. However, in some contexts, again in keeping with additional signs interpreting signs in other roles, the URI “sign” may also play the role of a symbolic “name” or even as a signal that the resource can be downloaded or accessed in electronic form. In other words, by virtue of the conventions that we choose to assign to our signs, we can supply additional information that augments our understanding of what the URI is, what it means, and how it is accessed.
Of course, in these regards, a URI is no different than any other sign in the Peircean world view: it must reside in a triadic relationship to its actual object and an interpretation of that object, with further understanding only coming about by the addition of further signs and interpretations.
In shortened form, this means that a URI, acting alone, can at most play the role of a pointer between an object and its referent. A URI alone, without further signs (information), can not inform us well about names or even what type of resource may be at hand. For these interpretations to be reliable, more information must be layered on, either by accepted convention of the current signs or the addition of still further signs and their interpretations. Since the attempts to deal with the nature of a URI resource by fiat as stipulated by httpRange-14 neither meet the standards of consensus nor empirical validity, the attempt can not by definition become “canonical”. This does not mean that httpRange-14 and its recommended practices can not help in providing more information and aiding interpretation for what the nature of a resource may be. But it does mean that httpRange-14 acting alone is insufficient to resolve ambiguity.
Moreover, what we see in the general nature of Peirce’s logic of signs is the usefulness of adding more “triads” of representation as the process to increase understanding through further interpretation. Kind of sounds like adding on more RDF triples, does it not?
Names, references, identity and meaning are not absolutes. They are not philosophically, and they are not in human language. To expect machine communications to hold to different standards and laws than human communications is naive. To effect machine communications our challenge is not to devise new rules, but to observe and apply the best rules and practices that human communications instruct.
There has been an unstated hope at the heart of the semantic Web enterprise that simply expressing statements in the right way (syntax) and in the right form (RDF) is sufficient to facilitate machine communications. But this hope, too, is naive and silly. Just as we do not accept all human utterances as truth, neither will we accept all machine transmissions as reliable. Some of the information will be posted in error; some will be wrong or ill-fitting to our world view; some will be malicious or intended to deceive. Spam and occasionally lousy search results on the Web tell us that Web documents are subject to these sources of unsuitability, why is not the same true of data?
Thus, global data access via the semantic Web is not — and can never be — indiscriminate nor unambiguous. We need to understand and come to trust sources and provenance; we need interpretation and context to decide appropriateness and validity; and we need testing and validation to ensure messages as received are indeed correct. Humans need to do these things in their normal courses of interaction and communication; our machine systems will need to do the same.
These confirmations and decisions as to whether the information we receive is actionable or not will come about via still more information. Some of this information may come about via shared conventions. But most will come about because we choose to provide more context and interpretation for the core messages we hope to communicate.
Nearly five years ago Hayes and Halpin put forth a proposal to add
ex:describedBy to the standard RDF vocabulary as a way for authors to provide context and explanation for what constituted a specific RDF resource . In various ways, many of the other individuals cited in this article have come to similar conclusions. The simple redirect suggestions of both Ian Davis  and Ed Summers  appear particularly helpful.
Over time, we will likely need further representations about resources regarding such things as source, provenance, context and other interpretations that would help remove ambiguities as to how the information provided by that resource should be consumed or used. These additional interpretations can mechanically be provided via referenced ontologies or embedded RDFa (or similar). These additional interpretations can also be aided by judicious, limited additions of new predicates to basic language specifications for RDF (such as the Hayes and Halpin suggestions).
In the end, of course, any frameworks that achieve consensus and become widely adopted will be simple to use, easy to understand, and straightforward to deploy. The beauty of best practices in predicates and annotations is that failures to provide are easy to test. Parties that wish to have their data consumed have incentive to provide sufficient information so as to enable interpretation.
There is absolutely no reason that these additions can not co-exist with the current httpRange-14 approach. By adding a few other options and making clear the optional use of httpRange-14, we would be very Peirce-like in our go-forward approach: We are being both pragmatic while we add more means to improve our interpretations for what a Web resource is and is meant to be.
See the 2007 thread on this issue, mostly by Sean Palmer and Noah Mendelsohn, the latter aknowledging that various experts may only agree on 85% of the items.
We have been maintaining Sweet Tools, AI3‘s listing of semantic Web and -related tools, for a bit over five years now. Though we had switched to a structWSF-based framework that allows us to update it on a more regular, incremental schedule , like all databases, the listing needs to be reviewed and cleaned up on a periodic basis. We have just completed the most recent cleaning and update. We are also now committing to do so on an annual basis.
Thus, this is the inaugural ‘State of Tooling for Semantic Technologies‘ report, and, boy, is it a humdinger. There have been more changes — and more important changes — in this past year than in all four previous years combined. I think it fair to say that semantic technology tooling is now reaching a mature state, the trends of which likely point to future changes as well.
In this past year more tools have been added, more tools have been dropped (or abandoned), and more tools have taken on a professional, sophisticated nature. Further, for the first time, the number of semantic technology and -related tools has passed 1000. This is remarkable, given that more tools have been abandoned or retired than ever before.
We first present our key findings and then overall statistics. We conclude with a discussion of observed trends and implications for the near term.
Some of the key findings from the 2011 State of Tooling for Semantic Technologies are:
Many of these points are elaborated below.
The updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories, each with over 6% of the total, are information extraction, general RDF tools, ontology tools, browser tools (RDF, OWL), and parsers or converters. The relative share by category is shown in this diagram (click to expand):
Since the last listing, the fastest growing categories have been SPARQL, linked data, knowledge bases and all things related to ontologies. The relative changes by tools category are shown in this figure:
Though it is true that some of this growth is the result of discovery, based on our own tool needs and investigations, we have also been monitoring this space for some time and serendipity is not a compelling explanation alone. Rather, I think that we are seeing both an increase in practical tools (such as for querying), plus the trends of linked data growth matched with greater sophistication in areas such as ontologies and the OWL language.
The languages these tools are written in have also been pretty constant over the past couple of years, with Java remaining dominant. Java has represented half of all tools in this space, which continues with the most recent tools as well (see below). More than a dozen programming or scripting languages have at least some share of the semantic tooling space (click to expand):
With only 160 new tools it is hard to draw firm trends, but it does appear that some languages (Haskell, XSLT) have fallen out of favor, while popularity has grown for Flash/Flex (from a small base), Python and Prolog (with the growth of logic tools):
PHP will likely continue to see some emphasis because of relations to many content management systems (WordPress, Drupal, etc.), though both Python and Ruby seem to be taking some market share in that area.
The higher incidence of Prolog is likely due to the parallel increase in reasoners and inference engines associated with ontology (OWL) tools.
The increase in comprehensive tool suites and use of Eclipse as a development environment would appear to secure Java’s dominance for some time to come.
These dry statistics tend to mask the feel one gets when looking at most of the individual tools across the board. Older academic and government-funded project tools are finally getting cleaned out and abandoned. Those tools that remain have tended to get some version upgrades and improved Web sites to accompany them.
The general feel one gets with regard to semantic technology tooling at the close of 2011 has these noticeable trends:
I have said this before, and been wrong about it before, but it is hard to see the tooling growth curve continue at its current slope into the future. I think we will see many individual tools spring up on the open source hosting sites like Google and Github, perhaps at relatively the same steady release rate. But, old projects I think will increasingly be abandoned and older projects will not tend to remain available for as long a time. While a relatively few established open source standards, like Solr and Jena, will be the exception, I think we will see shorter shelf lives for most open source tools moving forward. This will lead to a younger tools base than was the case five or more years ago.
I also think we will continue to see the dominance of open source. Proprietary software has increasingly been challenged in the enterprise space. And, especially in semantic technologies, we tend to see many open source tools that are as capable as proprietary ones, and generally more dynamic as well. The emphasis on open data in this environment also tends to favor open source.
Yet, despite the professionalism, sophistication and complexity trends, I do not yet see massive consolidation in the semantic technology space. While we are seeing a rapid maturation of tooling, I don’t think we have yet seen a similar maturation in revenue and business models. While notable semantic technology start-ups like Powerset and Siri have been acquired and are clear successes, these wins still remain much in the minority.
For some time now, Structured Dynamics (SD) has been touting the unique advantages of ODapps, or ontology-driven applications . ODapps are modular, generic software applications designed to operate in accordance with the specifications contained in one or more ontologies. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies (namely as domain ontologies), as supplemented by UI and instruction sets and validations and rules. When these supplements are added to standard ontology functions, we collectively term them adaptive ontologies .
To further the discussion around ODapps, today we are publishing two new documents, using the semantic technology foundation of the open semantic framework. OSF is a comprehensive, open source stack of SD and external tools that provides a turnkey environment for enterprises to adopt semantic technologies and approaches. OSF has been designed from the ground up to be an ontology-driven application framework.
The first new document, posted on Fred Giasson’s blog, provides a detailed discussion of the dozen or so roles ontologies can play within an OSF installation. Fred’s document is geared more to specific properties and configurations useful to deploy this framework; that is, the “drivers” in an ODapp setting. The second new document — this one — is more of a broad overview of the modularization and architecture of the constituent ontologies that make up an OSF installation. Both documents have also been posted to SD’s open content TechWiki , which now has about 360 technical articles on understanding and implementing an OSF installation, importantly including its ontologies.
As presently configured, an OSF installation may typically utilize most or all of the following internal ontologies:
(Note: the internal wiki links to each of these ontologies also provides links to the actual ontology specifications on Github.)
Depending on the specific OSF installation, of course, multiple external ontologies may also be employed. Some of the common external ones used in an OSF installation are described by the external ontologies document on the TechWiki. These external ontologies are important — indeed essential in order to ensure linkage to the external world — but have little to do with internal OSF control structures. That is why the rest of this discussion is focused on internal ontologies only.
The actual relationships between these ontologies are shown in the following diagram. Note that the ontologies tend to cluster into two main areas:
This ontology architecture supports the broader open semantic framework:
(click for full size)
The WSF ontology plays a special role in that it sets the overall permission and access rights to the other components and ontologies. The UMBEL ontology (or other upper-level ontologies that might be chosen) is also optional. Such vocabularies are included when interoperability with external applications or knowledge bases is desired.
We can further disaggregate these ontology splits with respect to the specific dozen or so ontology roles discussed in Fred’s complementary piece on ontology roles in OSF. These dozen roles are shown by the rows with interactions marked for the various ontologies:
|Define record descriptions||♦|
|Inform interface displays||♦||♦||♦|
|Integrate different data sources||♦||♦||♦|
|Define component selections||♦||♦||♦||♦|
|Define component behaviors||♦||♦|
|Guide template selection||♦||♦||♦|
|Provide reasoning and inference||♦||♦||♦|
|Guide content filtering (with and without inference)||♦||♦|
|Tag concepts in text documents||♦||♦||♦|
|Help organize and navigate Web portals||♦||♦|
|Manage datasets and ontologies||♦|
|Set access permissions and registrations||♦|
One of the unique aspects of adaptive ontologies is their added role in informing user interfaces and supporting specific semantic tools. Note, for example, the role of the content ontologies in informing interface displays, as well as their use in tagging concepts (via information extraction). These additional roles are the reason that these ontologies are shown as straddling both content and administrative functions in the first figure.
See Fred’s piece to learn more about these dozen roles.
Naturally, a simple drawn arrow between ontologies (first figure) or a checkmark on a matrix (table above) can hide important details of how these interactions between ontologies and components actually work. In an earlier article, we discussed how the whole workflow takes place between users and user interface selections affecting the types of data returned by those selections, and then the semantic components (widgets) used to display them. This example interaction is shown by the following animation:
(click for full size)
The blue nodes show the ontology interactions. These, in turn, instruct how the various components (yellow) and code (green) need to operate. These interactions are the essence of an ontology-driven app. The software is expressively designed to respond to specifications in the ontology(ies) used, and the ontologies themselves embrace some additional properties specific to driving those apps.
ODapps are a relatively new paradigm, from which we continue to learn more about uses and potentials. We have wanted to write the first versions of these two new documents for some time, but have held off as we learned and exploited further the latent potentials in this design. As it stands, we see further potentials in this approach, and will therefore be likely adding new ontologies and capabilities to the general system for some time.
Some of the areas that look promising to us include:
These potentials arise from the native power of the design basis for ontology-driven apps. Conceptually, the design is simplicity itself. Operationally, the system is extremely flexibile and robust. Strategically, it means that development and specification efforts can now move from coding and programmers to ontologies and the subject matter users who define and depend on them. With these advantages, who can argue with that?