Posted:June 2, 2014

Dawn of Artificial IntelligenceEight Massive Trends are Waking AI from Its Dark Winters

When I inaugurated this AI3 blog in 2005 I made this statement in the about section to clarify that the “three AIs” stood for adaptive information, adaptive innovation, and adaptive infrastructure, and not the AI of artificial intelligence:

. . . I personally believe artificial intelligence to be a lot of hooey and hype at best, and a misnomer and misdirection at worst. . . . ‘Artificial intelligence’ is a misdirection of attention and energy.

Gulp. OK. Time to take my medicine.

I am today formally retracting those statements — probably should have done so some time ago — and want to explain why. As much as anything, it has to do with the changing understanding of what is artificial intelligence, recently affirmed by global-scale applications and technologies, working effectively right now.

Many Winters within AI

Though the idea of automatons and intelligent agents standing in for humans is about as old as human storytelling, the real basic ideas around artificial intelligence became current as part of the World War II effort and were finally given a name in a famous 1956 conference at Dartmouth. Initially namers and advocates of artificial intelligence included such founders as John McCarthy, Herbert Simon, Claude Shannon and Marvin Minsky. Money to support early interest in artificial intelligence came from the part of the US military that eventually became ARPA (now DARPA), with the funding going to individual researchers to use as they wished as opposed to specific projects. Along with many futuristic visions of the 1950s to 1970s, the promises for artificial intelligence were bold, including being able to capture and automate most notable basic human capabilities.

Popular movies and books promoted the ideas of autonomous robots that we could speak with and command and that would anticipate our needs and wishes so as to act as simulacrum agents lessening our burdens and adding to our leisure and capabilities [1]. Algorithms would be discovered and codified that would mimic the basis of human thought and intelligence. The idea of the Turing machine established a defensible basis for foreseeing that any problem of mathematical logic could be captured and taken on by computers.

The predictable failure of this vision to deliver caused a backlash, sufficient that the US Congress prohibited further open-ended funding via the Mansfield Amendments in 1969 and 1973, such that by 1974 AI funding in the US had largely dried up. Similar restrictions were applied to the British research community. The result of this backlash caused the first of what would prove to be many “winters” of funding and acceptance for AI.

Roughly a decade later, in response to the perceived Japanese threat for “fifth-generation” computing in the mid-1980s, a number of AI programs were again funded. While hardware developments were proceeding apace, efforts around McCarthy’s AI-oriented language Lisp and common sense logic frameworks (what are now called ontologies or knowledge graphs) such as Cyc began to receive sponsorship again. The mid-1990s were the time of “expert systems,” to be populated by knowledge engineers charged with interviewing internal subject matter experts (SMEs) to codify their knowledge for later reuse. These efforts, too, disappointed in terms of the lack of practical benefits delivered. More AI winters ensued.

AI (“artificial intelligence”) came to again lose its credibility. Some researchers moved into specific algorithmic disciplines — Bayesian statistics and neural networks predominant — while others shifted into such areas a “hyperlinks” and what became the semantic Web. Today, one could argue, that the lost mojo of AI has affected those in the semantic Web in almost a dialectic way. First, there are those who embrace the idea of intelligent agents and global knowledge structures, more-or-less in keeping with some sort of vision of artificial intelligence. Second, there are those that have seen the failures of the past, do not want to repeat them, and are more inclined to support “loosely bounded” structure focused on bottoms-up assertions. OWL modelers and ontologists tend to occupy the first camp; linked data advocates more the second camp.

The natural community for knowledge representation and management has thus tended to bifurcate a bit: global, “visionary” AI types, with history to overcome and challenged by the sheer scale of what emerged from the Internet; and incrementalists, happy to accept a bit of RDF structured data in the hopes of an ongoing evolution to more structure and interoperability.

Ten years ago, when I made the conscious decision to reject the AI of artificial intelligence as a label for this blog, an algorithmic-vision of AI seemed “wrong” and not in keeping with the general trends of the Web. That was the basis and justification for my then-statements on AI. But a funny thing happened on the way to a cogent forecast: a massive disruption called the Internet came about that — while it took a decade to gestate — changed the whole underlying substrate over which AI could take place. Like so much of history, innovation had presented to us an entirely different reality upon which to “understand” and develop artificial intelligence. It is those changes — plus the fruits from them — that is defining AI in a new light.

Eight AI Megatrends

There are, by my reckoning, at least eight major trends that have been improving AI’s prospects, especially over the past decade (Numbers #3 to #7 below are quite related to AI, the other three are general trends.) Some of the proven wonders we now see in use such as speech recognition, speech synthesis, language translations, entity recognition, image and facial recognition, computer vision, question answering, autocompletion and spell correction, recommendation systems, sentiment analysis, information extraction, document categorization, natural language processing, machine learning, reasoning, optical character recognition, word sense disambiguation, search and information retrieval, and text generation and summarization, with their many additional categories and sub-categories, are proof these trends are making a difference. None individually constitutes what may be called “AI”, but, in combination, they show compellingly that much of AI’s initial vision is indeed being fulfilled to some degree and in some specific aspect today.

Nearly all of these applications correspond to the Grand Challenges for symbolic computing identified in the 1980s. Until a decade ago, very few of them save search and initial NLP were producing results with sufficient quality and accuracy. Now, all are.

In the past ten years, most evident in the past five, tremendous breakthroughs have occurred across the entire spectrum of artificial intelligence applications. We can point to at least the eight following megatrends enabling these breakthroughs.

#1 Computer Power

A constant river of innovation has fueled the logarithmic power improvements in computers since the first transistor. Moore’s law has led to massive improvements in hardware cost, numbers of computation cycles, and amounts of bits stored. Networking capabilities are now truly global and numbers of interconnected devices exceeds billions. Computer software innovations lead to faster and better procedures and methods; as a category, software innovation likely exceeds hardware improvements as a source of computing productivity. What today fits in the palm of our hand thirty years ago required entire rooms, and did not do one billionth of what can be done today.

The rich savanna of computing has itself encouraged a bloom of innovations, many of which contribute to artificial intelligence prospects.

#2 The Internet (and Web)

Though clearly a related function to the general improvements in computing and hardware, the advent of the Internet and its more relevant offspring of the Web has had, I believe, the most fundamental impact on the change in prospects for artificial intelligence. The sheer scale of the Web network has made available crowdsourced innovations like Wikipedia and other crowdsourced data and knowledge bases. More broadly, global content across the entire Web, accessible via a common HTTP protocol, multiplied every individual’s access to information — pay close attention — by a factor of a billion or more.

Because the entire Web is interconnected, the sheer raw grist of connected data available to analyze such things as relatedness or similarity is gamechanging. Manual constructs and derived relations from years past can now be multiplied and magnified at Web scale. Any relationship test or validation can be accomplished nearly instantaneously and at (essentially) zero cost. Phenomenal!

#3 Expectations

The discrediting of AI and its holdover smell has itself been a factor working in its favor. By being discredited, it has been possible for multiple possible AI components, many listed herein, to be developed and attended to in relative isolation. Each of today’s current pieceparts to AI could be focused upon on their own, without taint from the broader “AI” brush. Because the constituents were recognizable and justifiable on their own, they did not need to fulfill the past overblown visions and expectations for “AI” writ large. The pieceparts could develop in peace.

This observation, if true, means that grand visions like “artificial intelligence” are perhaps rarely (ever?) the result of a grand top-down plan. Rather, like a good stew, it is individual components that need to mature and become available to create the final meal. Since these ingredients need to stand or contribute on their own for their own purposes, the actual resulting stew may vary as to its ultimate ingredients. If one ingredient is not ripe or available, we vary our recipe according to what is available. There is no one single recipe leading to a tasty stew.

Put another way, AI has been flying under the radar for at least the last ten to fifteen years. Portions of the older AI agenda have benefited from specific attention. Better still, the new emergence of the idea of artificial intelligence is also more toned down and practical. Artificial intelligence is now, I believe, understood to be part of a process and not some autonomous embodiment. Human interaction and communication are themselves imprecise and subject to error. Why should not be artificial means to boost those same human capabilities?.

From the standpoint of expectations, artificial intelligence has evolved from science fiction to essentially zero awareness, meanwhile delivering, on a broad scale, focused wonder capabilities such as (nearly) instantaneous translations across 60 leading human languages.

#4 Global Knowledge Bases

How can a system promise useful suggestions or alternatives if it is bereft of information?

At the local or personal level we well understand that we need to describe ourselves via attributes, the more the merrier in terms of a more complete description. A pretty good record for me would include such things as physical description, image, work and economic description, family and life description, education description, text narratives from fun to historical,  etc. The more complete description of me requires many sources and many attributes and many perspectives. But, of course, I do not live alone in the world. To describe my world, which constantly changes, I need to describe other thousands of entities I encounter daily. Each of these, too, has many attributes and relationships to other entities. Each of these entities also changes over time (has histories) and place. So, context becomes another critical dimension.

The growth of the Web at scale has resulted in some tremendous knowledge bases of entities and concepts. Freebase and Wikipedia are two of the best known, but virtually every domain has its own sources and richness. These knowledge bases, in turn, are often open for use by others. Text mining and digital data mean these data can be combined and made to interoperate. That process is only just beginning.

Though early efforts in artificial intelligence understood that capturing and modeling common sense was both an essential and surprisingly difficult task — the impetus, for example, behind the thirty-year effort of the Cyc knowledge base –  what is new in today’s circumstance is how these massive knowledge bases can inform and guide symbolic computing. The literally thousands of research papers regarding use of Wikipedia data alone [2] shows how these massive knowledge bases are providing base knowledge around which AI algorithms can work.

The abiding impression is that the availability of these data sources has fundamentally changed how AI is done. Unlike the early years of mostly algorithms and rules, AI has now evolved to explicitly embrace Web-scale content and data and the statistics that may be derived from global corpora.

#5 Deep Learning

Machine learning is a core AI concept used to determine discriminative characteristics or patterns within source input data. It has been a constant emphasis of AI since the beginning.

Various machine learning algorithms — such as Markov chains, neural networks, conditional random fields, Bayesian statistics, and many other options — can be characterized among many dimensions. Some are supervised, meaning they need to be trained against a standard corpus in order to estimate parameters; others require little or no training, but may be less accurate as a result. Some are statistically based; others are based on pattern matching of various forms.

A more recent trend has been to combine multiple techniques in what is known as deep learning, where the problem set is modeled as a layered hierarchy of distributed representations, with each layer using (often) neural network techniques for unsupervised learning, followed by supervised feedback (often termed “back-propagation”) to fine-tune parameters. While computationally slower than other techniques, this approach has the advantage of automating the supervised learning phase and is proving generally most effective across a range of AI applications.

More fundamentally, there is a virtuous circle of feedback occurring between AI machine learning algorithms and reference knowledge and statistical bases (see next). This can extend the accuracy, completeness and efficiency of supervised methods. Some notable academic departments have relied on Web-scale corpora (University of Washington and Carnegie Mellon University are two prominent examples in the US). The most dominant player in this realm, however, has been Google (though all of the major search engine and social networking companies have smaller initiatives of similar character).

#6 Big Statistical Data

Using both statistical techniques and results from machine learning, massive datasets of entities, relationships and facts are being extracted from the Web. Some of these efforts, such as the academic NELL (CMU) or KnowItAll or Open IE (UWash) involve extractions from the open Web. Others, such as the terabyte (TB) n-gram listings from Google, are derived from Web-scale pages or Google books. These examples are but a sampling of various datasets and corpora available.

These various statistical datasets may be used directly for research on their own, or may contribute to further bootstrapping of still further-refined AI techniques. Similar datasets are aiding advertising placements, search term disambiguation and machine (language) translation. In some cases, while the full datasets may not be available, open APIs may be available for areas such as entity identification or tabular data.

What is important about these trends is that data, statistics and algorithms are all now being combined in various ways with the aim of achieving acceptable AI-backed results at Web scale. It is really via the combination of these techniques that we are seeing the most impressive AI results.

#7 Big Structure

A more nascent area, really in just its first stages of effectiveness, is the application of “big structure” to all of this information. By “big structure” I mean the application of domain and knowledge graphs to help arrange and place the concepts and entities at hand.

At Web scale, the early Yahoo! directory and Open Directory were the first examples of structuring domains. Wikipedia next became the most widely used category structure; Freebase, for example, used Wikipedia to initially bootstrap its own structure. A portion of Freebase is now what is used for Google’s own Knowledge Graph. DBpedia also created its own ontology out of the infobox structure of Wikipedia. The major search engines have also put forward the schema.org structure as a means of (mostly) organizing entity and attribute information and structured data. schema.org putatively is an input to the Google Knowledge Graph, but the exact mechanism and ability to trace the results is pretty opaque.

The need for big structure is rapidly emerging as one of the key challenges for Web-scale AI. The Web and crowdsourcing appears well suited to being able to generate entity and attribute data. What remains unclear is how this information can be coherently organized at the scale of the Web. This problem is becoming acute, because the success of “big data” on the Web needs to ultimately find an organized, coherent expression in the aggregate. This is one major AI challenge that remains distinctly unsolved, though promising first steps exist.

#8 Open Source and Content

The major theme of these AI breakthroughs comes from leveraging the global content of the Web. And this enabler, in turn, has been critically dependent on the open source nature of AI algorithms, software code and code infrastructure and architecture, and open content and (generally) open APIs. Open code, algorithms, datasets and knowledge have expanded the pool of human intelligence that can be brought to bear on the question of artificial intelligence. The positive feedbacks greased through open channels of information, code and data have been absolutely essential to the amazing AI progress of the past few years.

To be sure, open does not mean a level playing field. (See discussion on Google, next.) But, without open source and open content and data, I think no one could argue that progress would have been anywhere near as rapid as it has been. The synergy arising from open source and content has thus been another essential factor in the recent and rapid progress in AI.

The Race to Intelligence

Since innovation is the source of wealth creation, it is also no surprise that the megatrends surrounding AI have also drawn significant investment interest. This interest is in the form of a race to acquire the most innovative AI startups and human expertise (capital) in AI. Since Google has been my common touchstone in this piece — and because Google is the biggest gorilla in the room — we can use them to illustrate the scope and pace of this race. (Though Amazon, Facebook, Microsoft and IBM are also clearly entrants in this race.)

A number of recent articles, notably ones in the Washington Post and The Economist, have highlighted the total dollars at stake in this AI race. Over the past few years, there have been perhaps more than $20 billion in AI-related company acquisitions, with Nest Technologies (Google, $3.2 B), Kiva Systems (Amazon, $775 M), and DeepMind (Google, $660 M) some of the largest.

Within Google alone, there has been a buying spree in search improvements (~ $1.4 B total), robotics ($80 M), machine synthesis and recognition ($250 M), machine learning ($700 M), smart devices ($3.6 B), compression technologies ($200 M), natural language processing ($80 M), and a smattering of others ($50 M), not to mention its internal efforts in self-driving cars. I don’t monitor Google on a constant basis and likely missed some major and relevant acquisitions, but it does appear that Google has perhaps spent over $6 billion over the past five years or so for AI-related acquisitions [3].

As important as start-up acquisitions has been Google’s commitment to hire and partner with many of the leading AI researchers in the world. Besides the strong partnerships Google maintains with such institutions such as the University of Washington, Carnegie Mellon University, MIT, Stanford, UC Berkeley and others, it has also staffed its research ranks with prominent names from those institutions and others.

Peter Norvig, one of the early advocates for combining algorithmic and statistical AI, joined Google in 2001 and is now its Director of Research. Most recently and notably, Ray Kurzweil joined Google as Director of Engineering in 2012. Other notable AI researchers at Google include Alon Halevy (FusionTables), Ramanathan Guha (schema.org), Geoffrey Hinton (deep learning), Evgeniy Gabrilovich (search and machine learning), and many others for whom I am not as familiar with their research. There is probably more AI talent combined at Google than has ever been assembled in one institution before.

With IBM’s Watson getting its own division and Facebook funding an AI center to the tune of $10 B, plus Apple making a similar commitment to robotic manufacturing, it is clear that all of the major players in the computing space are making big bets on AI moving into the future.

AI is Itself But One Beneficiary of These Trends

Since the early winters in artificial intelligence, a phenomenon has developed called the “AI effect“. It really has meant two different things.

First, AI researchers have tended to call their research anything but artificial intelligence. One of the broader and trendy substitutes is known as cognitive computing. Many of the domains and disciplines I noted above got their names and prominent use as substitutes for what used to be labeled as AI. In any case, we can see that AI indeed is a big tent with many components and thrusts.

Second, the “AI effect” also refers to the fact that once an AI technique is embedded in some everyday use, it is no longer perceived as something AI and is taken as a given. Douglas Hofstadter expressed the AI effect concisely by quoting Tesler‘s Theorem: “AI is whatever hasn’t been done yet.”

I was perhaps right to initially reject the algorithmic-centric view of AI from the early years. But now, when matched with big data, big statistics and big structure, all embedded into phenomenal advances in computing power, it is also clear that we are dawning into a new age of AI. One only needs to look at the wondrous progress on many of what had seemed to be impossible Grand Challenges over the past five years to gain an appreciation of the pace and breadth of new developments to come.

These developments will reify and foster similar emphases in semantic technologies, graph structures and analysis, and functional programming and homoiconicity (“data as code, code as data”) that my colleague, Fred Giasson, is now actively exploring. We will find that representational paradigms and the basis of how our tools and algorithms work will increasingly align. There appear to be natural underpinnings to these phenomena, including the pivot of language and meaning, that are closely aligned with the thoughts and writings of that great American pragmatist and logician, Charles S. Peirce. We will increasingly come to see that the wondrous innovations of self-driving cars, talking smartphones, warehouses of fulfillment robots, and computer vision systems can trace their roots back to basic truths of how to see and understand our world.

Understanding these forces will, themselves, help to formulate guidelines and ideas that can foster further innovation. So, in the end, while I still don’t like the term of “artificial” intelligence, it is merely a sign or a term. Adaptive innovations expressed by machines are simply part of the intelligence and structure embodied in the universe, for which we are now gaining the tools and understanding to exploit.


[1] Douglas AdamsHyperland is a great exposition on this vision, with my 2007 blog post pointing to the online video.
[2] Wikipedia maintains its own page of research that relies on Wikipedia; I have earlier captured about 250 selected sources called SWEETpedia that relate specifically to semantic technologies and AI.
[3] These are merely estimates, and likely quite wrong in many specifics. The estimates were compiled by reviewing a listing of Google acquisitions (since 2009), supplemented by individual company searches when the acquisition amounts were not listed, followed by analysis of Google’s SEC Edgar filings in a manner similar to this analysis (which was also used for the robotics estimate).
Posted:April 24, 2014
Open Semantic Framework

Another Expansion in Documentation for the Open Semantic Framework

The Open Semantic Framework is a complete foundation to bring semantic technology capabilities to the enterprise. OSF has applications from enterprise information integration to collaboration networks and open government. It has been under development since 2009, leveraging a set of robust open source engines and connecting Web services and architecture, and is now in its third major version. OSF is fully integrated as a semantic technology extension to the Drupal content management system.

Structured Dynamics, the developer of OSF, with the generous support of SD clients, has been committed to provide excellent documentation and tech transfer support to OSF since its inception. For examples, OSF now has a nearly 500 document technical support library, plus many automated means for installing and testing the OSF stack.

Yet, as all of us know, written documentation is not always discovered nor read. The paradigm for technology transfer is shifting to online tutorials and screencasts.

In keeping with that trend, SD has committed itself to develop a (hopefully) complete suite of online screencasts and tutorials geared to the nuts-and-bolts of how to install, configure, test, manage and use an OSF installation. Our intent is to aid users to bring semantics into the enterprise without the need for external support or cost.

We call this curriculum of tech transfer screencasts and video tutorials the OSF Academy.

Over the past week we have been releasing the first dozen screencasts in this series. With this foundation, it is now time to make a broader announcement of the OSF Academy.

So, On With it Now

Welcome!

We are on pace to release many dozens of specific screencasts on all use and management aspects of OSF. Please stay tuned over the coming weeks.

You can always see the complete contents of the YouTube channel at the Open Semantic Framework Academy.

Also, as basic grounding, also know that the OSF Wiki section on screencasts is another central access point to this content.

The Series Begins

Most of the screencasts are quite specific in certain use aspects of the Open Semantic Framework. However, tutorial #1 is a useful overview of OSF and the series:

The Next Ten Screencasts

SD’s CTO, Fred Giasson, is the key demo jockey for most of the OSF Academy screencasts. Many of these screencasts are technical, and all are specific and focused. Access each screencast by number below. There is also a (blog post) associated with each screencast that provides useful background information and links.

Thumbnail

Where Next?

We have nearly four dozen additional screencasts in our plan to round out introductory material to OSF. Please monitor our OSF channel on YouTube to stay on top of these releases.

Posted by AI3's author, Mike Bergman Posted on April 24, 2014 at 1:16 am in Open Semantic Framework, Structured Dynamics, Videos | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/1725/osf-academy-inaugurates-with-eleven-screencasts/
The URI to trackback this post is: http://www.mkbergman.com/1725/osf-academy-inaugurates-with-eleven-screencasts/trackback/
Posted:February 24, 2014

Smell the MoneyTo Combat a Decline in Mindshare, Follow What is Pragmatic

A secret of the semantic Web community is that energy, innovation and participation have slipped over, say, the past three or four years. This has been obvious for some time. I began collecting statistics on such things as prevalence in Google searches, attendance at SemTech or xSWC meetings, postings to user groups, blog postings, heck, even stupid and lengthy controversies on the mailing lists, or the sale and then sale and then sale of SemTech itself.

Fortunately, I realized that my observation of a decline did not depend on having documentary backup: the trend was obvious. So, I could stop collecting time-sucking statistics. I’m sure many of the participants in the formation of the semWeb know exactly of this decline in energy and focus of which I speak.

Other endeavors have kept me from worrying too much about such matters, but recent griping in public forums about the state of the semantic Web got me again thinking about premises and the state of semantic technologies. Such re-thinks are useful because they help put current circumstances into context, and because they help guide how to spot emerging opportunities.

While I am not feeling overwhelmingly passionate about such matters, there does appear to be a villain in this story, what I might term the FYN crowd [1]. But, like all good villains and stories, villainy is mostly a matter of context, with the winners being the ones writing the history. So, accept my thoughts as arising as much from my own worldview as from anything else . . . .

Galileo’s BallsGalileo's Balls

Once one embraces an intellectual domain with the premise of semantics, then meaning and context a priori become first citizens. Depending on viewpoint, what the semantic Web means to one individual can differ substantially from another individual. Moreover, the space becomes a sort of cipher for expressing any worldview, legitimately. For example, one tension at the heart of the semantic Web enterprise has been bottom up v top down; another has been anything goes v more structure and formalism. Hot buttons arise when worldviews differ, as they always surely do. The semantic Web is no exception.

Yet the stated bases for these semantic Web hot buttons, I would claim, are simplistic. What really occurs in the semantic technology space is something more akin to the Galileo thermometer, multiple viewpoints finding multiple resting points. Only in the semantic Web case, the natural resting points don’t just simply occur along a single dimension of, say, formalism, but other viewpoints as well. So, what we end up with is something more akin to a 3D- or multi-dimensional column. There are an infinite number of resting points in reasoned discourse.

Why should this be strange or threatening? Of course, upon inspection, it is not. The understanding that needs to arise is that semantics is truly about differences at all levels of human experience, perceptions and language. A pragmatic semantics must reflect this reality.

I don’t think that these sentiments will ever translate into precision or algorithms. But they can be modeled approximately with algorithms and refined with judgment. Much of their essence can also be captured by ontologies. These are viewpoints that can be captured in silico and used to help humans make better decisions. Semantics are essential to these prospects. At the heart of any pragmatic semantics must be an accommodation of viewpoints and terminology.

The real point in all of this — actually, also the major reason for semantic technologies in the first place — is that for any topic of normal human discourse there is a variety of viewpoints. Only a system expressly designed to respect these differences can be an effective digital means of interoperability.

Tribal Diversities

There are many tribes within the semantic technology space. Academic researchers are the most visible tribe. Because of funding nuances and general interest and tradition (though there are real differences between the US, Queen’s countries, EU or Asia), academics have — and sometimes continue to — set the tone for the semantic Web community. This has been useful to establish a coherent and (generally) logical basis to the underpinnings of the semantic Web. But most in the community would also acknowledge this basis is not sufficient to achieve commercial breakthrough.

In the US, there is a strange mix, with many semantic researchers flying below the radar, because they work for the three-letter intelligence agencies. Also, there is a very strong biomedical community, often funded from the National Library of Medicine. The biomedical community has been an exemplar innovator. Because of this community’s efforts, we now can see how an entire domain — biomedical — can develop and leverage ontologies, establish common vocabularies or standards, or cooperate on tools development. There is no public community more advanced in semantic techology developments than the biomedical one.

Another tribe in this space is the successful hunter, able to use semantic technology capabilities to attract and secure paying customers. Most of the activities of these tribe members is hidden from view, because their paying efforts are by nature infrastructural and concentrated on enterprise and commercial customers. But, also, many individuals within this tribe actively contribute to public efforts and conferences. Many of the more visible semantic technology companies, including my own, occupy this space.

But the most enriched tribe of the semantic Web has been the background semantic orchestrator, generally through infrastructure-based initiatives like broadscale knowledge representation, statistical analysis of massive text corpora, well-considered ontologies, or knowledge structures. The semantic efforts of the search engine vendors, including Bing and Google’s knowledge graph, are members of this tribe, as is Siri, now part of Apple.

These differences in market focus and visibility have tended to play out in expected ways. Academic researchers, Web enthusiasts and those committed to open data have been most vocal about “linked data”. They tend to be the more visible participants in semantic Web mailing lists and forums. Casual followers of the semantic technology space, or those new to it, mostly hear these same voices. By default, the apparent health and status of the semantic Web is more-or-less defined by these voices.

When I said in the intro that the semantic Web has slipped over the past few years, that perception is mostly the result of the lowered volume and fewer messages coming from the vocal tribe. But there are two problems with the accuracy resulting from that. The first, as argued above, is that the vocal and visible linked data advocates are not the only representatives of the community. And, the second, which I’ll get to in a moment, is that the vocal community’s prescriptions for the semantic Web, in my opinion, are no longer the most meaningful ones.

Branding, Terminology and Marketing Messages

Pig SnoutsMany early proponents of the semantic Web, I think it fair to observe, would say that two positioning mistakes (from their perspective) have kept the paradigm from grabbing greater hold. The first reason often cited is the use of XML as the initial syntax of RDF. At first blush, I agree with this observation, given that when I was first entering into the dark chambers of the semantic Web it was at times difficult to separate XML from RDF. Today, though, most semWeb practitioners prefer the use of alternative serializations. I personally don’t think that any difficulties that semantic Web understanding and adoption may pose today are any longer influenced by a decade-old XML confusion. In Web years, these are eons.

The second reason seems to have been the flat-out retreat from “semantic Web” terminology. The conscious decision to switch to the “linked data” branding began in earnest about 2008. I find this shift interesting. I think it relates to looking to the wrong measures of success. What seemed like a clever re-branding at that time has both set the focus in the wrong direction and consequently set the wrong targets for measuring success.

In the areas of standards and movements, moral authority, suasion and prominence often become the bases for who is viewed as “owning” a new concept. There has been much of this posturing around the “semantic Web” and “linked data”, with parry thrusts from “Web 3.0″ and “big data” and “open this or that”. So, I’m not surprised that branding many of the concepts of the semantic Web with a new term — “linked data” — was pushed and took hold. But why original semantic Web advocates adopted this term and its shift in focus from an ecosystem to data representation and exchange does surprise me.

The strange thing, in my opinion, is the monadic emphasis on “linked data” that acts to partially kill the semantic Web minding. Whether by design or fallout, “linked data” inexorably shifts the focus to how data is represented and transmitted. It is a royal pain in the ass for publishers to publish “linked data” and then, when done, there is surprisingly little consumption of it. The MusicBrainz announcement it was dropping RDFa last week is telliing [2]. We are seeing the representation of structured Web data being driven on other bases, as evidenced by the success of JSON, something that linked data enthusiasts have only lately come to embrace, and the schema.org initiative of the major search engines.

Once linked data was raised as the lead banner, other branding messages followed. The first add-on message was “follow-your-nose”. FYN represents clcking from link to link following data references of interest on the Web [1]. In order for that be facilitated, but also as a means to clear up some confusions about linked data, the quality standard of “5-star linked data” was also put forth. To achieve all five stars, linked data should conform to open standards such as RDF and link to other data for context [3].

Today, on virtually all “official” semantic Web forums you will see mention of the brands of linked data, FYN, 5-star linked data, and open data. Publishing of data according to best practices that enables global links from datum to datum across the Giant Global Graph has become the sort of gold standard associated with this new branding.

What is the Measure of Success?

Success is always measured against our premises and values. In the case of the vocal tribe, the premises and values relate to linked and open data. By these measures, the semantic Web is a mixed bag. On the positive front, many laudable sources of quality data — most recently the Getty Museum [4], but also the Library of Congress and arts and humanities publishers across Europe, but also including many science realms beyond biology, and of course hundreds of others made famous by the LOD cloud [5]  — are published as linked data. or in the process being so. Open data sets are coming from government at all levels [6].

On the negative front, the growth of pubished linked data has fallen behind the pace of publishing structured data in general, and notable evidence for where the consumption of linked data has made a difference is pretty hard to find. Linked data advocates only rarely discuss integration with “closed”, proprietary data or enterprise use, integration and realities. Shitty sameAs assertions abound everywhere. Markets find it hard to get excited when the arguments and reference frameworks don’t relate well to their actual problems and pain points. DBpedia can only go so far, and a mountain of links to it without relevance, context or quality is just so much more noise [7].

The point here is not to mount a screed against linked data, but to caution: Be careful how you brand yourself. By the measures of growth and penetration and uptake of linked data, moreover linked open data, the semantic Web space is generally not attracting developer interest, media attention or venture dollars. I hope the release of meaningful linked data continues, but setting that goal as the measure of the semantic Web’s success is selling the wrong product.

Rather than setting a FYN objective as to whether our semantic technology efforts to date have been a success, I suggest we adopt a “follow the money” (FT$) premise. Who is investing or making money off of this stuff, and how and why? Herein lies a different measure of success.Money River

If we look to the approaches taken by those making money in this market, we find that the:

  • Challenges of meaningful connections
  • Interoperability
  • Integration across document and structured data
  • Discovering new patterns and relationships
  • Facilitating semantic understanding across disparate communities and legacy data sources, and
  • Providing quality characteristics for new entities,

are where the bucks are being made. These activities are all at the heart of the knowledge worker’s job responsibilities. Even the earliest advocates of the semantic Web must have had aspirations that the semantic Web had the promise to address these meaningful challenges.

Another secret to systems like Freebase, Google knowledge graph, Bing, Watson, Siri, or similar innovations is their use and reliance on Wikipedia, at least in their formative stages. Though often DBpedia was the structural form of ingest, the core basis of these systems’ capabilities comes from content — Wikipedia — the access to which was only made easier via DBpedia.

The sentiment to follow the money is not a sell out or a political statement. It is a recognition that work worth doing is work others appreciate and are willing to pay for. It is the best signal amidst the noise of what is valuable to work on.

It’s Time for the Side B Hit

I’ve been a fairly active participant in the semantic Web for nearly 10 years. I sometimes have the image of an aspiring music artist from the ’50s or ’60s arguing with the record execs which song should be the favored Side A cut on the 45. The visible voices of the semantic Web want to push FYN and linked data as Side A, but it really isn’t selling, according to the advocates’ own success measures.

The Side B of interoperability, RDF and OWL is not just “filler” to the main promotion, but where I clearly think the hit resides. Some have heard that track, buy it, and are enthused about it. It would be nice if the record execs could see what is right before their face and begin promoting it as well.

FYN and its vocal proponents risk the perception of failure of the semantic Web enterprise from the simple fact of putting linked data front and center. Sure, it is a good approach with potentially rich information so long as you can trust the source both for the content itself and the quality of its RDF expression. No one is arguing with that.

But SGML and ASN.1, one could argue, in similar veins, amongst actually dozens of others, were great and useful notations, yet are now mostly historical footnotes. If a trusted source is going to serve me up 5-star linked data, I will take it. Yet the truth is I would take structured data in any form from a trusted source, but take no linked data from an unknown source or one with poor linkages. We spend much time looking at these issues for our clients, and it is the rare linked data set that becomes part of our solution. Even then, we carefully scrutinize all assumed connections.

The Side B semantic Web of vetted and interlinked, interoperable data organized by competent graphs is the winning side. It is the only location where true economic transactions are taking place around the semantic Web. To understand where the semantic Web makes sense, follow the Side B money to your answers.

The insight gained from a FT$ approach clearly points to the failure of FYN. I say, do linked data if you can, it is the best ingest format around. But don’t get too hung up on that. Spend your time figuring out how to bridge meaningful gaps in semantics or data across any enterprise, global or local. Information is not truffles, and following your nose is not the primary argument for the semantic Web.

[1] FYN. or Follow Your Nose, reflects is the general practice of performing web retrieval on URIs in a knowledge base to obtain more knowledge. Two W3C articles provide additional commentary. In the linked data context, FYN represents clcking from link to link following data references of interest. FYN is a specific pattern of linked data. Ed Summers provided one of the better overviews of the use of FYN in the context of linked data and the Web of Data.
[2] See the MusicBrainz blog from February 18, 2014.
[3] Tim Berners-Lee describes 5-star linked open data in this article.
[4] The Getty Museum recently made a portion of its Arts and Architecture Thesaurus (AAT) open source using linked data; see http://blogs.getty.edu/iris/art-architecture-thesaurus-now-available-as-linked-open-data/.
[5] The linked open data (LOD) cloud diagram and supporting information is maintained at http://lod-cloud.net/.
[6] I have often written on the problems with linked and open data as presently practiced. See Practical P-P-P-Problems with Linked Data (October 4, 2010) and The Nature of Connectedness on the Web (November 22, 2010) as two examples. Specific commentary on open data in government is provided in When Linked Data Rules Fail (November 16, 2009).
[7] For another assessment of the state of the semantic Web, see Brian Sletten’s recent Keep On Keeping On article on semanticweb.com (January 13, 2014).
Posted:February 4, 2014

Civic Dynamics Logo Some Thoughts on SD’s Gestation of Civic Dynamics

Structured Dynamics (SD) announced yesterday that, in association with its partner Buzzr, it was spinning off a new software company, Civic Dynamics Inc., headquartered in Québec City, Canada. Included in the launch was the introduction of the new company’s Civic Dynamics Platform.  CDP is open-source software and supporting systems to assist municipalities to publish dynamic open government data, and to provide citizens a set of tools for viewing, searching, filtering and analyzing that data.

The announcements of those releases stand on their own. My purpose is not to duplicate them. Rather, now that the efforts needed for the new launch are behind us, I wanted to reflect on why and how such a spin off occurred in the first place. I think these reflections offer some insight into imperatives that face new software ventures, especially those geared to enterprise IT.

A Bit of History

It was just about five years ago that Fred Giasson and I began Structured Dynamics. (This was also after a year working together at Zitigist under the sponsorship of OpenLink Software.) Our mission at SD’s inception was to create a workable platform for bringing semantic technology capabilities to enterprises. Our specific interest was in using semantic technologies and RDF to solve the decades-old challenge of information interoperability in larger organizations. By serendipity, we were able to secure an enterprise client on virtually the first day we started SD. That forced us to grapple immediately with the then current woeful state of semantic technologies for enterprises.

We observed a number of problems at that time. Here is a short list of some of those problems from five years ago, and brief statements of what we initiated to address them:

  • Search — native triple stores at that time were not performant in search, and none captured the full text of documents. Further, semantic search offers unique opportunities in structure and inference. As a result, we were one of the first to adopt Solr for semantic technologies
  • Portal framework — there was a (general) absence of portal front ends that met acceptance in the marketplace. We evaluated and chose Drupal; over time a design choice to have loose coupling with Drupal has transitioned to become more integrated
  • CRUD — basic database management capabilities, such as create, read, update or delete, were not often exposed at the application level. Our choice here was to decouple this access and adopt a distributed design by embracing RESTful web services, endpoints and APIs, all of which were geared to provide a universal abstraction for dealing with all data engines (as colllectively expressed as a “repository”)
  • Architecture — though complete frameworks had been put forward, mostly by academic researchers, most had short lives and all lacked basic enterprise capabilities. We designed an architecture that favored integration and expansion — largely though APIs — while leveraging existing components. We also at this time made a commitment to open-source for all key components of the architecture
  • Stack — there were no complete software (deployment) stacks. Creating one required fragmented piece parts with gaps; and there certainly were not standard deployment or installation abilities. Much of SD’s effort over the past five years has been addressing this gap
  • Access control (security) — virtually all enterprises need to control access to privileged information, and no security existed five years ago for semantic applications. In the early versions of the Open Semantic Framework, the foundation to Civic Dynamics’ CDP platform, we used a simple IP authorization approach based on the interaction of tool (endpoint), dataset and role. Subsequently we have established middleware integrations with third-party security and key-based permission mechanisms when OSF is used standalone
  • Version control — any enterprise content system or repository must also have ways to track revisions and enforce version control. Early semantic technologies completely lacked these considerations. OSF has made progress in integrating with the Drupal revisioning system and in establishing middleware methods for interfacing with third-party version control systems
  • Workflow support — managing enterprise content in general, and more specifically managing the semantic aspects of integration, requires formal workflow and governance procedures. However, historically, and up to and including today, there is zero workflow support in semantic technologies. In fact, there is virtually no discussion of this topic at all. We are only at the beginning stages of incorporating formal workflow methods into OSF. We have development methodologies and best practices, though, and have identified suitable workflow engines to extend the system with formal workflow methods
  • Data ingest — five years ago there was little recognition that data in the wild would not be compliant with W3C standards nor RDF, and as a result demo systems lacked ingest capabilties for legacy information, particularly enterprise database info.  OpenLink Software and its Virtuoso system (one of the core engines in OSF) did, however, recognize this need. The OSF design has very much followed this approach of using “converters” or “RDFizers” for getting all wild data forms into a canonical RDF basis internal to the system
  • Reference vocabularies — the ultimate means of integrating enterprise information to achieve interoperability is premised on semantic approaches and technologies. Yet, apart from some minor vocabularies, there were no suitable vocabularies five years ago in many areas. We have constructed and supported an across-domain reference vocabulary (UMBEL) and a means of representing instance data and records (irON) since then to redress these gaps
  • Tooling — the means to design, manage, and test components of a semantic enterprise stack were nearly totally lacking, since most early semantic efforts were of proofs-of-concept and not production-grade systems. The areas of ontology design and maintenance were (and stilll somewhat are) weak. We have since developed many new tools, some geared at the user level, with administrator and developer tools including test suites, command-line utilities, and automated installers
  • Templates, widgets and visualization — the highly structured nature of RDF data lends itself well to templating records by type, page layouts by type, and widgets by type, which may be further leveraged using inheritance and inference. The recognition of the role of semantic technologies as publishing platforms did not exist five years ago. Our response has been to develop a template inheritance system and semantic data-driven widgets. Our earliest widgets were based on Flash; the libraries are now migrating to JavaScript (d3.js) and HTML5
  • Lack of documentation — lack of documentation is the bane of most open source projects, and early semantic technologies were most often developed for academic theses or as proofs-of-concept. As a result, documentation of use to practitioners and administrators was totally lacking. SD has made a concerted commitment to improved and complete documentation. Our OSF wiki with its nearly 500 technical and metholodogy articles and accompanying images is one expression of that commitment
  • Lack of enterprise rigor — across all of these fronts, early semantic technologies were clearly not designed and developed with enterprise objectives and use cases in mind. SD’s overall commitment has been to rectify this gap.

What we did have five years ago was a growing list of (often) unproven open standards (principally those from the W3C) and a large roster of prototype and research tools [1], most from the academic community. Still, there were some proven engines suitable to a semantic stack (most adopted as core to the Open Semantic Framework), so there were building blocks upon which a complete framework could be based. With the right design and architecture, and appropriate “glue” to tie it all together, it appeared quite feasible to create a working semantic stack suitable for enteprise use. Multi-component, open-source packages — ranging from Alfresco to Talend or Pentaho — were showing the path to such next-generation platforms.

With the development model of an integrated semantic technology stack based on open source components and consistent “glue” in mind, we could then turn our attention to the business model and strategy behind the nascent Open Semantic Framework.

The Business Philosophy

I don’t speak much about my prior ventures because, well, they are in the past. But I have financed ventures via angel funding, venture capital, grants and client revenues. I also have background in ventures ranging across many aspects of enterprise (mostly) and consumer (less so) software.

Our funding prejudice in starting SD was to be self-financed via clients. A customer focus keeps one from getting too abstract or falling in love with innovations for which there may not be a real market. Revenue financing also means that we need not alter business strategy or approach based on a financier’s perspective. Customers call the shots; not the money interests. This funding prejudice has kept us market focused and, as a consequence, profitable since day one.

Our staffing prejudice was to not hire, at least during the framework development phase. Setting the vision for a framework is not a democratic activity, and every hire means less development productivity. To fulfill, we have partnered and employed consultants and sub-contractors, but have not diluted our own efforts in managing employees.  We could stay focused by feeding only our own mouths and our vision.

Such narrow bandwidth also carries other implications. We could not take on too many clients at a given time. We needed to be extremely productive and leveraged, finding opportunities wherever we could to re-purpose prior writings or reusing or generalizing code. We also needed to be quite selective in what projects and what clients we chose. When attempting to make progress on a new platform, it is important to not become simply a contract fulfillment shop. Customers have many options for IT contracting or outsourcing; platform development and growth requires a certain self-selection by clients.

Our standard contract emphasizes that (most) efforts are intended to be open source, and our intellectual property clauses make that explicit. At first we did not know how the market might react to this insistence. For prospects serious enough to commit monies to us, however, we have found a good appreciation that open source leads to lower current project costs because the client is leveraging what has already been developed before. It seems only fair that new developments should also be made available to later customers, as well. Some of our prior clients are now seeing the lower costs and benefits by leveraging intermediate work in upgrading to latest versions and functionality.

Our fulfillment prejudice has been to complete work on time and under budget, document and train the customer in the work, and move on. Though we know they are profitable and a bread-and-butter for most enterprise vendors, we have not sought recurring annuities from our clients in maintenance fees. By keeping our eye squarely on successful tech transfer, we are disciplining ourselves to document as we go, provide tooling and support infrastructure as well as application software, and to find efficiencies in fulfillment. Meanwhile, we are able to progress rapidly on our overall development roadmap without getting bogged down in handholding. We would rather teach the customer how to fish, rather than doing the fishing for them.

Of course, not all enterprises understand or embrace these philosophies. That is fine under our development approach where market understanding and refinements are the drivers of decisions, not maximizing revenues for an increasingly growing staff count. We have been blessed to have new clients arise whenever they are needed, and to be real partners with us in furthering the vision. We have actively rejected some customer prospects because the philosophical fit was not good. We have also actively weaned ourselves from some engagements by insisting on sunsets for our support and encouraging more tech transfer and training.

These prejudices may change as we see the underlying Open Semantic Framework nearing fulfillment of its development vision. But, for an open source platform in a hurry (even considering it has been five years!), we believe these philosophies have served us and our clients well.

An Emphasis on the Open Semantic Framework

The net outcome for the Open Semantic Framework has been to emphasize a generic, enterprise-ready design that can be rapidly embraced and adopted by multiple markets. We have called OSF a platform of ontology-driven applications. ODapps are modular, generic software applications designed to operate in accordance with the specifications contained in one or more ontologies. ODapps fulfill specific generic tasks. Examples of current ontology-driven apps include imports and exports in various formats, dataset creation and management, data record creation and management, reporting, browsing, searching, data visualization and manipulation, user access rights and permissions, and similar. These applications provide their specific functionality in response to the specifications in the ontologies fed to them.

The ODapp vision underlying the design of OSF means we can leverage an architecture of generic tools to respond to virtually any knowledge application or any enterprise domain. The basic idea is shown by this diagram, which we first published about three years ago:

The Open Semantic Framework can Spawn Many Different Domain Instances

(click for full size)

In the five years of development of OSF, now at version 3.x (recently announced), we have had the good fortune to have clients and uses in publishing, tech transfer of R&D, group collaboration, health, automotive, air traffic control, sustainability, community indicators and local government. Demand in the latter two areas has been particularly strong. The strength of that market interest was the source of the dilemma for Structured Dynamics.

Unique Demands of Municipal Markets

The idea of rapid and nimble development of a new platform — especially one expressly designed to be generic across multiple domains — does not readily square with focusing on a specific market segment. This disconnect is particularly true for quite unique markets, as is the case for local governments.

In a past life I spent nearly ten years working for a trade association that represents municipally-owned electric utilities. APPA has members ranging from huge municipalities such as Los Angeles, Toronto, Seattle and San Antonio, to the smallest towns and burgs of the plains of North America. In my former role running the R&D and technical programs for this association, I personally interacted with hundreds of these wide-ranging individual communities.

In the larger communities, the electric utilities were separate departments from the local government per se, and were directed by professional utility managers. But for mid-size and smaller communities, there was often close interaction with all municipal departments.

Though sales lead times are long for all enterprise markets, they are particularly long and (often political) in government. Budgets are perennially tight. Budgets need to be proposed, argued with councils and management, and approved before work can begin. Staff are stretched across multiple functions, so use and maintenance are key factors as are concerns about longer-term support contracts. Portals and Web sites must serve all constituencies and content and tone need to be suitable to taxpayer-supported venues. Yet, because of the number and diversity of communities [2], across the entire market there is surprising innovation and experimentation. Finding better ways to do more with less is a key motivator in the local government market.

Specifically, in our own use of OSF in this market, we also observed some other unique aspects related to open data and Web sites. What constitutes open data and whether and how to make it “open” varies widely by community. Capturing local needs and perspectives often leads to comparatively high costs in theming and customizing the Web sites. The lack of dedicated and trained staff to care and feed a new Web site is always a challenge.

Structured Dynamics, with its generic platform interests and avoidance of staffing, is clearly not the right vehicle to pursue this market. Specific focus on the unique aspects of the local government market is required, plus modifications and specializationis of the platform to address government needs. Possible integration or incorporation of standard local government Web site(s) may also be required. Though we were seeing keen interest from this market, in order to address it properly a different vehicle with different venture imperatives was necessary.

Doing Justice to the Local Government Market

Early on, our good colleague and friend, Steve Ardire, helped point out some gaps in our business development. We saw that three things were missing within Structured Dynamics itself to do the local government and open government data markets justice. First, we needed a dedicated company to focus solely on this market. Second, we needed an executive familiar with the OSF platform and municipal government to head the effort. And, third, we somehow needed a way to overcome the time and costs associated with tailoring the portal for local community needs.

It was actually the last of these things that showed the first solution. We were approached about eighteen months ago by Ed Sussman, the CEO of Buzzr, about possibilities of partnering for the local government market. Buzzr has a one-click solution to theming and customizing individual Web portals, buillt around the Drupal content management system (CMS). Buzzr, a NYC-based company, has impeccable Drupal chops, having been co-founded by one of the leading Drupal shops, Lullabot. Buzzr has proven the applicability of their approach to specific verticals, including retail and education. The fact that Buzzr found us and saw a good fit for the municipal market was a formative discussion. We welcomed Buzzr’s outreach because their approach squarely addresses one of the cost and effort sticking points we were observing.

When Ed first contacted us, the OSF platform was still not sufficiently mature to be a market foundation. We needed more time to refine the platform, as well as to gain more market insight from use and use cases. Fortunately, Ed and Buzzr kept their interest strong while we refined things in the background. By the time we were able to address the other missing items, Buzzr was there to partner with us on the new venture.

Our second requirement was met by hiring Kelly Goldstrand, formerly the project manager for the NOW (Neighbourhoods of Winnipeg) portal, to head up the venture’s business development. NOW is one of the flagship installations of OSF. Her career focus has been on service planning, delivery and evaluation in the area of community health, protection and development. Kelly has significant management experience in local government and clearly understood OSF; her guidance had been pivotal in much of the system’s functionality. Kelly also has a proven track record in mentoring projects through local approvals and training city staff in use and maintenance of new technologies. After early retirement Kelly was ready to consider our opportunity and then graciously agreed to join us.

The last piece of the puzzle was forming the new venture. We had been working with the Civic Dynamics name for some time, and had also played around a bit with logo and Web site. Once the other things fell into place, we incorporated Civic Dynamics, Inc. in Québec (where it is also known as Dynamique Civique), given the strong market interest shown in Canada to date, and began preparing for the formal launch of the venture. We also needed to await the completion of OSF v 3.0.

A Report Card on SD’s Multi-year Plan

It now appears likely that the five-year plan we set for ourselves at the founding of Structured Dynamics may actually take six to seven years to achieve. This time extension derives from the realities of our client work over this time frame. One reality is that client-specific needs have caused us to necessarily divert from our own internal development path. Not all development can contribute to fulfilling a generic platform. Every client has unique needs and circumstances that are not generalizable to others. A second reality is that only through real client engagements can market requirements be truly discerned. Customer-centric development is absolutely essential to keep software grounded.

Meanwhile, Back at Civic Dynamics

We are as curious as the next person to see whether a dedicated spin-off is the right way to handle a specific vertical market. It will also be interesting to see how coordination and support can best be provided between the dynamics duo (Structured and Civics).

Nonetheless, we are excited about finally getting postured to pursue the growing market for open, local government data. We’d like to thank Kelly and Ed and all of our original sponsors for helping to gestate the venture to this point. Now that it has been birthed, we hope to nurture it and get it on its own two feet as soon as possible. Before we know it, and assuming we’ve raised it properly, Civic Dynamics will be celebrating its own life events!


[1] See our Sweet Tools listing of about 1000 semantic technology tools
[2] There is a total of about 24,000 municipal governments across the United States and Canada.
Posted:January 27, 2014


Open Government Data
It’s Time to Move Beyond Static Dataset Dumps

It would be an understatement to say that open data has been transforming how government does business. Over the past five years — ranging from national governments such as the United States and the United Kingdom to hundreds of local governments and municipalities and all forms of government in between — a veritable revolution in opening up data to the public has been underway. The open data in government (OGD) movement has spawned an entirely new cottage industry in open data advocacy and tools. Literally hundreds of government organizations are committed to open data, supported by an ecosystem of advocacy, technology and consulting groups.

Open data, of course, is not limited to governments. Open data in science and from the Web and for-profit entities are legitimate focal points in their own right. But, because data generated by governments are both sanctioned and developed using taxpayer monies, open data in government (OGD) occupies a special place in the conversation. Now, with experience and practice, we are beginning to see a generational shift in how open data is being handled by governments. The first generation, still mostly the current practice, was built around the idea of just making the data public and open. This current generation of open data is characterized by the publishing of datasets via catalogs. The datasets are static, unconnected and dumb. Mostly, too, the data within those datasets are poorly described and documented, often lacking standard metadata. What is now exciting, however, is the emergence of what can best be called dynamic open data. What this is and how it offers advantages is the focus of this article.

The 8 Initial Principals of Open Government Data

In October 2007, 30 open government advocates met in Sebastopol, California to discuss how government could open up electronically-stored government data for public use. Up until that point, the federal and state governments had made some data available to the public, usually inconsistently and incompletely, which had whetted the advocates’ appetites for more and better data. The conference, led by Carl Malamud and Tim O’Reilly and funded by a grant from the Sunlight Foundation, resulted in eight principles that, if implemented, would empower the public’s use of government-held data. These principles, no longer online, were summarized by Joshua Tauberer in his Open Government Data book as:

  1. Data Must be Complete
    All public data are made available. Data are electronically stored information or recordings, including but not limited to documents, databases, transcripts, and audio/visual recordings. Public data are data that are not subject to valid privacy, security or privilege limitations, as governed by other statutes.
  2. Data Must be Primary
    Data are published as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.
  3. Data Must be Timely
    Data are made available as quickly as necessary to preserve the value of the data.
  4. Data Must be Accessible
    Data are available to the widest range of users for the widest range of purposes.
  5. Data Must be Machine Processable
    Data are reasonably structured to allow automated processing of it.
  6. Access Must be Non-Discriminatory
    Data are available to anyone, with no requirement of registration.
  7. Data Formats Must be Non-Proprietary
    Data are available in a format over which no entity has exclusive control.
  8. Data Must be License-free
    Data are not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed as governed by other statutes.

These basic principles were then updated and re-phrased by the Sunlight Foundation in August 2010 to now number 10 principles, including the use of open standards, making data permanent, and keeping usage costs to an absolute minimum. All of these are laudable points. Each may or may not be provided in a fully open way by any given governmental entity.

This first step in the open data process has led to systems that are oriented to posting and publishing downloadable datasets. Existing open government data platforms, for example, such as Socrata or DKAN, can best be described as catalog systems. Listings of datasets with associated descriptions and metadata are presented. Users or the public may then chose among one or more machine-readable formats to download the entire dataset.

The 5 Added Principles of Dynamic Open Data

Of course, simply throwing data over the fence does not make it useful. Once we can get past the first threshold of making data publicly accessible, we next face the challenge of making that data meaningful and relevant. Since relevance is in the eye of the user, we no longer can think about information solely in terms of static, dumb datasets. We now need to expose the underlying data dynamically, such that users may request and filter and correlate what they need and only what they need.

Thus, there are five principles — or dimensions — by which we need to judge next-generation dynamic open data:

  1. Data Should be Filterable
    Data should be selectable by type (class), attribute or value such that only the data of interest is exposed to the user. This means the data should be structured in some way with facets that can be used dynamically to filter and make those selections.
  2. Data Should be Atomic
    Data should be exposed as individual entities or concepts with their attributes and values. The unit of manipulation thus becomes the datum, rather than the dataset.
  3. Data Should be Connected
    Because we are now collecting by datum and not dataset, connections between relevant things must be made explicit across relevant datasets. Similar things should be retrievable together. To achieve this aim, some schema or data definition framework must be layered over the data and datasets.
  4. Data Should be Expandable
    Since new data and new instances and new datasets will constantly arise, the design of the overall data management system must itself be “open”, enabling expansion of the available datastore at acceptable cost and effort.
  5. Data Should be Documented
    In order for these dynamic selections to be achievable, the data in the system must be fully documented, specifically including the full description and units used for attributes and values and the scope of entities and concepts. Only through such complete documentation can accurate connections and relevant selections per above be made.

There is no set order to the principles above. They are presented in the order shown so as to help remember them through the FACED mneumonic.

Parallels with Linked Data

Though the principles above do not call out linked data as a requirement, they do share many parallels with the early growth and maturation of linked data. A number of years back Fred Giasson and I commented on When Linked Data Rules Fail. Two of the points made in that article are the absence of suitable data descriptions and lack or wrong connections in data.gov and the NY Times datasets. I subsequently expanded on these types of problems in Practical P-P-P-Problems with Linked Data.

Official data from governments can avoid many of the provenance issues associated with general linked data, but in other areas there are important parallels. Like any emerging new practice, it takes a while to learn and formalize best practices. It is not surprising that we are seeing open data in government needing to transition from dumb datasets to actionable information. Making data actionable is when government information assets will finally become effective for the broader public.

Also, like linked data, it is likely the platforms built around semantic technologies and knowledge graphs (schema) will also come to the fore. Our own Open Semantic Framework is one such example, but there are a few now emerging in the linked data and semantic technology space. It will be through different practices and these newer platforms that we will see the next generation of open government data truly emerge.

Posted by AI3's author, Mike Bergman Posted on January 27, 2014 at 2:55 pm in Open Semantic Framework, Structured Dynamics | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/1707/welcoming-a-new-era-of-dynamic-open-data/
The URI to trackback this post is: http://www.mkbergman.com/1707/welcoming-a-new-era-of-dynamic-open-data/trackback/