Posted:August 16, 2010

I Have Yet to Metadata I Didn’t Like

Contrasted with Some Observations on Linked Data

At the SemTech conference earlier this summer there was a kind of vuvuzela-like buzzing in the background. And, like the World Cup games on television, in play at the same time as the conference, I found the droning to be just as irritating.

That droning was a combination of the sense of righteousness in the superiority of linked data matched with a reprise of the “chicken-and-egg” argument that plagued the early years of semantic Web advocacy [1]. I think both of these premises are misplaced. So, while I have been a fan and explicator of linked data for some time, I do not worship at its altar [2]. And, for those that do, this post argues for a greater sense of ecumenism.

My main points are not against linked data. I think it a very useful technique and good (if not best) practice in many circumstances. But my main points get at whether linked data is an objective in itself. By making it such, I argue our eye misses the ball. And, in so doing, we miss making the connection with meaningful, interoperable information, which should be our true objective. We need to look elsewhere than linked data for root causes.

Observation #1: What Problem Are We Solving?

When I began this blog more than five years ago — and when I left my career in population genetics nearly three decades before that — I did so because of my belief in the value of information to confer adaptive advantage. My perspective then, and my perspective now, was that adaptive information through genetics and evolution was being uniquely supplanted within the human species. This change has occurred because humanity is able to record and carry forward all information gained in its experiences.

Adaptive innovations from writing to bulk printing to now electronic form uniquely position the human species to both record its past and anticipate its future. We no longer are limited to evolution and genetic information encoded in surviving offspring to determine what information is retained and moves forward. Now, all information can be retained. Further, we can combine and connect that information in ways that break to smithereens the biological limits of other species.

Yet, despite the electronic volumes and the potentials, chaos and isolated content silos have characterized humanity’s first half century of experience with digital information. I have spoken before about how we have been steadily climbing the data federation pyramid, with Internet technologies and the Web being prime factors for doing so. Now, with a compelling data model in RDF and standards for how we can relate any type of information meaningfully, we also have the means for making sense of it. And connecting it. And learning and adapting from it.

And, so, there is the answer to the rhetorical question: The problem we are solving is to meaningfully connect information. For, without those meaningful connections and recombinations, none of that information confers adaptive advantage.

Observation #2: The Problem is Not A Lack of Consumable Data

One of the “chicken-and-egg” premises in the linked data community is there needs to be more linked data exposed before some threshold to trigger the network effect occurs. This attitude, I suspect, is one of the reasons why hosannas are always forthcoming each time some outfit announces they have posted another chunk of triples to the Web.

Fred Giasson and I earlier tackled that issue with When Linked Data Rules Fail regarding some information published for data.gov and the New York Times. Our observations on the lack of standards for linked data quality proved to be quite controversial. Rehashing that piece is not my objective here.

What is my objective is to hammer home that we do not need linked data in order to have data available to consume. Far from it. Though linked data volumes have been growing, I actually suspect that its growth has been slower than data availability in toto. On the Web alone we have searchable deep Web databases, JSON, XML, microformats, RSS feeds, Google snippets, yada, yada, all in a veritable deluge of formats, contents and contexts. We are having a hard time inventing the next 1000-fold description beyond zettabyte and yottabyte to even describe this deluge [3].

There is absolutely no voice or observer anywhere that is saying, “We need linked data in order to have data to consume.” Quite the opposite. The reality is we are drowning in the stuff.

Furthermore, when one dissects what most of all of this data is about, it is about ways to describe things. Or, put another way, most all data is not schema nor descriptions of conceptual relationships, but making records available, with attributes and their values used to describe those records. Where is a business located? What political party does a politician belong to? How tall are you? What is the population of Hungary?

These are simple constructs with simple key-value pair ways to describe and convey them. This very simplicity is one reason why naïve data structs or simple data models like JSON or XML have proven so popular [4]. It is one of the reasons why the so-called NoSQL databases have also been growing in popularity. What we have are lots of atomic facts, located everywhere, and representable with very simple key-value structures.

While having such information available in linked data form makes it easier for agents to consume it, that extra publishing burden is by no means necessary. There are plenty of ways to consume that data — without loss of information — in non-linked data form. In fact, that is how the overwhelming percentage of such data is expressed today. This non-linked data is also often easy to understand.

What is important is that the data be available electronically with a description of what the records contain. But that hurdle is met in many, many different ways and from many, many sources without any reference whatsoever to linked data. I submit that any form of desirable data available on the Web can be readily consumed without recourse to linked data principles.

Observation #3: An Interoperable Data Model Does Not Require a Single Transmittal Format

The real advantage of RDF is the simplicity of its data model, which can be extended and augmented to express vocabularies and relationships of any nature. As I have stated before, that makes RDF like a universal solvent for any extant data structure, form or schema.

What I find perplexing, however, is how this strength somehow gets translated into a parallel belief that such a flexible data model is also the best means for transmitting data. As noted, most transmitted data can be represented through simple key-value pairs. Sure, at some point one needs to model the structural assumptions of the data model from the supplying publisher, but that complexity need not burden the actual transmitted form. So long as schema can be captured and modeled at the receiving end, data record transmittal can be made quite a bit simpler.

Under this mindset RDF provides the internal (canonical) data model. Prior to that, format and other converters can be used to consume the source data in its native form. A generalized representation for how this can work is shown in this diagram using Structured Dynamics‘ structWSF Web services framework middleware as the mediating layer:

Of course, if the source data is already in linked data form with understood concepts, relationships and semantics, much of this conversion overhead can be bypassed. If available, that is a good thing.

But it is not a required or necessary thing. Insistence on publishing data in certain forms suffers from the same narrowness as cultural or religious zealotry. Why certain publishers or authors prefer different data formats has a diversity of answers. Reasons can range from what is tried and familiar to available toolsets or even what is trendy, as one might argue linked data is in some circles today.There are literally scores of off-the-shelf “RDFizers” for converting native and simple data structs into RDF form. New converters are readily written.

Adaptive systems, by definition, do not require wholesale changes to existing practices and do not require effort where none is warranted. By posing the challenge as a “chicken-and-egg” one where publishers themselves must undertake a change in their existing practices to conform, or else they fail the “linked data threshold”, advocates are ensuring failure. There is plenty of useful structured data to consume already.

Accessible structured data, properly characterized (see below), should be our root interest; not whether that data has been published as linked data per se.

Observation #4: A Technique Can Not Carry the Burden of Usefulness or Interoperability

Linked data is nothing more than some techniques for publishing Web-accessible data using the RDF data model. Some have tried to use the concept of linked data as a replacement for the idea of the semantic Web, and some have recently tried to re-define linked data as not requiring RDF [5]. Yet the real issue with all of these attempts — correct or not, and a fact of linked data since first formulated by Tim Berners-Lee — is that a technique alone can not carry the burden of usefulness or interoperability.

Despite billions of triples now available, we in fact see little actual use or consumption of linked data, except in the life science domain. Indeed, a new workshop by the research community called COLD (Consuming Linked Data) has been set up for the upcoming ISWC conference to look into the very reasons why this lack of usage may be occurring [6].

It will be interesting to monitor what comes out of that workshop, but I have my own views as to what might be going on here. A number of factors, applicable frankly to any data, must be layered on top of linked data techniques in order for it to be useful:

Context and coherence (see below)
Curation and quality control (where provenance is used as the proxy), and
Up-to-date and timely.

These requirements apply to any data ranging from Census CSV files to Google search results. But because relationships can also be more readily asserted with linked data, these requirements are even greater for it.

It is not surprising that the life sciences have seen more uptake of linked data. That community has keen experience with curation, and the quality and linkages asserted there are much superior to other areas of linked data [7].

In other linked data areas, it is really in limited pockets such as FactForge from Ontotext or curated forms of Wikipedia by the likes of Freebase that we see the most use and uptake. There is no substitute for consistency and quality control.

It is really in this area of “publish it and they will come” that we see one of the threads of parochialism in the linked data community. You can publish it and they still will not come. And, like any data, they will not come because the quality is poor or the linkages are wrong.

As a technique for making data available, linked data is thus nothing more than a foot soldier in the campaign to make information meaningful. Elevating it above its pay grade sets the wrong target and causes us to lose focus for what is really important.

Observation #5: 50% of Linked Data is Missing (that is, the Linking part)

There is another strange phenomenon in the linked data movement: the almost total disregard for the linking part. Sure data is getting published as triples with dereferencable URIs, but where are the links?

At most, what we are seeing is owl:sameAs assertions and a few others [8]. Not only does this miss the whole point of linked data, but one can question whether equivalence assertions are correct in many instances [9].

For a couple of years now I have been arguing that the central gap in linked data has been the absence of context and coherence. By context I mean the use of reference structures to help place and frame what content is about. By coherence I mean that those contextual references make internal and logical sense, that they represent a consistent world view. Both require a richer use of links to concepts and subjects describing the semantics of the content.

It is precisely through these kinds of links that data from disparate sources and with different frames of reference can be meaningfully related to other data. This is the essence of the semantic Web and the purported purpose of linked data. And it is exactly these areas in which linked data is presently found most lacking.

Of course, these questions are not the sole challenge of linked data. They are the essential challenge in any attempt to connect or interoperate structured data within information systems. So, while linked data is ostensibly designed from the get-go to fulfill these aims, any data that can find meaning outside of its native silo must also be placed into context in a coherent manner. The unique disappointment for much linked data is its failure to provide these contexts despite its design.

Observation #6: Pluralism is a Reality; Embrace It

Yet, having said all of this, Structured Dynamics is still committed to linked data. We present our information as such, and provide great tools for producing and consuming it. We have made it one of the seven foundations to our technology stack and methodology.

But we live in a pluralistic data world. There are reasons and roles for the multitude of popular structured data formats that presently exist. This inherent diversity is a fact in any real-world data context. Thus, we have not met a form of structured data that we didn’t like, especially if it is accompanied with metadata that puts the data into coherent context. It is a major reason why we developed the irON (instance record and object notation) non-RDF vocabulary to provide a bridge from such forms to RDF. irON clearly shows that entities can be usefully described and consumed in either RDF or non-RDF serialized forms.

Attitudes that dismiss non-linked data forms or arrogantly insist that publishers adhere to linked data practices are anything but pluralistic. They are parochial and short-sighted and are contributing, in part, to keeping the semantic Web from going mainstream.

Adoption requires simplicity. The simplest way to encourage the greater interoperability of data is to leverage existing assets in their native form, with encouragement for minor enhancements to add descriptive metadata for what the content is about. Embracing such an ecumenical attitude makes all publishers potentially valuable contributors to a better information future. It will also nearly instantaneously widen the tools base available for the common objective of interoperability.

Parochialism and Root Cause Analysis

Linked data is a good thing, but not an ultimate thing. By making linked data an objective in itself we unduly raise publishing thresholds; we set our sights below the real problem to be solved; and we risk diluting the understanding of RDF from its natural role as a flexible and adaptive data model. Paradoxically, too much parochial insistence on linked data may undercut its adoption and the realization of the overall semantic objective.

Root cause analysis for what it takes to achieve meaningful, interoperable information suggests that describing source content in terms of what it is about is the pivotal factor. Moreover, those contexts should be shared to aid interoperability. Whichever organizations do an excellent job of providing context and coherent linkages will be the go-to ones for data consumers. As we have seen to date, merely publishing linked data triples does not meet this test.

I have heard some state that first you celebrate linked data and its growing quantity, and then hope that the quality improves. This sentiment holds if indeed the community moves on to the questions of quality and relevance. The time for that transition is now. And, oh, by the way, as long as we are broadening our horizons, let’s also celebrate properly characterized structured data no matter what its form. Pluralism is part of the tao to the meaning of information.

[1] See, for example, J.A. Hendler, 2008. “Web 3.0: Chicken Farms on the Semantic Web,” Computer, January 2008, pp. 106-108. See http://www.comp.leeds.ac.uk/webscience/talks/hendler_web_3.pdf. While I can buy Hendler’s arguments about commercial tool vendors holding off major investments until the market is sizable, I think we can also see via listings like Sweet Tools that a lack of tools is not in itself limiting.

[2] An earlier treatment of this subject from a different perspective is M.K. Bergman, 2010. “The Bipolar Disorder of Linked Data,” AI3:::Adaptive Information blog, April 28, 2010.

[3] So far only prefixes for units up to 10^24 (“yotta”) have names; for 10^27, a student campaign on Facebook is proposing “hellabyte” (North California slang for “a whole lot of”) to get adopted by science bodies. See http://scitech.blogs.cnn.com/2010/03/04/hella-proposal-facebook/.

[4] One of more popular posts on this blog has been, M.K. Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” AI3:::Adaptive Information blog, January 22, 2009.

[5] See, for example, the recent history on the linked data entry on Wikipedia or the assertions by Kingsley Idehen regarding entity attribute values (EAV) (see, for example, this blog post.)

[6] See further the 1st International Workshop on Consuming Linked Data (COLD 2010), at the 9th International Semantic Web Conference (ISWC 2010), November 8, 2010, Shanghai, China.

[7] For example, in the early years of GenBank, some claimed that annotations of gene sequences due to things like BLAST analyses may have had as high as 30% to 70% error rates due to propagation of initially mislabeled sequences. In part, the whole field of bioinformatics was formed to deal with issues of data quality and curation (in addition to analytics).

[8] See, for example: Harry Halpin, 2009. “A Query-Driven Characterization of Linked Data,” paper presented at the Linked Data on the Web (LDOW) 2009 Workshop, April 20, 2009, Madrid, Spain, see http://events.linkeddata.org/ldow2009/papers/ldow2009_paper16.pdf; Prateek Jain, Pascal Hitzler, Peter Z. Yehy, Kunal Vermay and Amit P. Shet, 2010. “Linked Data is Merely More Data,” in Dan Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness, Linked Data Meets Artificial Intelligence, Technical Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp. 82-86., see http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf; among others.

[9] Harry Halpin and Patrick J. Hayes, 2010. “When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web,” presented at LDOW 2010, April 27th, 2010, Raleigh, North Carolina. See http://events.linkeddata.org/ldow2010/papers/ldow2010_paper09.pdf.

Posted:August 9, 2010

An Executive Intro to Ontologies

Ontologies are the structural frameworks for organizing information on the semantic Web and within semantic enterprises. They provide unique benefits in discovery, flexible access, and information integration due to their inherent connectedness; that is, their ability to represent conceptual relationships. Ontologies can be layered on top of existing information assets, which means they are an enhancement and not a displacement for prior investments. And ontologies may be developed and matured incrementally, which means their adoption may be cost-effective as benefits become evident [1].

What Is an Ontology?

Ontology may be one of the more daunting terms for those exposed for the first time to semantic technologies. Not only is the word long and without common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.

The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”

Much like taxonomies or relational database schema, ontologies work to organize information. No matter what the domain or scope, an ontology is a description of a world view. That view might be limited and miniscule, or it might be global and expansive. However, unlike those alternative hierarchical views of concepts such as taxonomies, ontologies often have a linked or networked “graph” structure. Multiple things can be related to other things, all in a potentially multi-way series of relationships.


A distinguishing characteristic of ontologies compared to conventional hierarchical structures is their degree of connectedness, their ability to model coherent, linked relationships

Ontologies supply the structure for relating information to other information in the semantic Web or the linked data realm. Ontologies thus provide a similar role for the organization of data that is provided by relational data schema. Because of this structural role, ontologies are pivotal to the coherence and interoperability of interconnected data.

When one uses the idea of “world view” as synonomous with an ontology, it is not meant to be cosmic, but simply a way to convey how a given domain or problem area can be described. One group might choose to describe and organize, say, automobiles, by color; another might choose body styles such as pick-ups or sedans; or still another might use brands such as Honda and Ford. None of these views is inherently “right” (indeed multiples might be combined in a given ontology), but each represents a particular way — a “world view” — of looking at the domain.

Though there is much latitude in how a given domain might be described, there are both good ontology practices and bad ones. We offer some views as to what constitutes good ontology design and practice in the concluding section.

What Are Its Benefits?

A good ontology offers a composite suite of benefits not available to taxonomies, relational database schema, or other standard ways to structure information. Among these benefits are:

Coherent navigation by enabling the movement from concept to concept in the ontology structure
Flexible entry points because any specific perspective in the ontology can be traced and related to all of its associated concepts; there is no set structure or manner for interacting with the ontology
Connections that highlight related information and aid and prompt discovery without requiring prior knowledge of the domain or its terminology
Ability to represent any form of information, including unstructured (say, documents or text), semi-structured (say, XML or Web pages) and structured (say, conventional databases) data
Inferencing, whereby by specifying one concept (say, mammals) one knows that we are also referring to a related concept (say, that mammals are a kind of animal)
Concept matching, which means that even though we may describe things somewhat differently, we can still match to the same idea (such as glad or happy both referring to the concept of a pleasant state of mind)
Thus, this means that we can also integrate external content by proper matching and mapping of these concepts
A framework for disambiguation by nature of the matching and analysis of concepts and instances in the ontology graph, and
Reasoning, which is the ability to use the coherence and structure itself to inform questions of relatedness or to answer questions.

How Are Ontologies Used?

The relationship structure underlying an ontology provides an excellent vehicle for discovery and linkages. “Swimming through” this relationship graph is the basis of the Concept Explorer (also known as the Relation Browser) and similar widgets.

The most prevalent use of ontologies at present is in semantic search. Semantic search has benefits over conventional search in terms of being able to make inferences and matches not available to standard keyword retrieval.

The relationship structure also is a powerful and more general and more nuanced way to organize information. Concepts can relate to other concepts through a richness of vocabulary. Such predicates might capture subsumption, precedence, parts of relationships (mereology), preferences, or importances along virtually any metric. This richness of expression and relationships can also be built incrementally over time, allowing ontologies to grow and develop in sophistication and use as desired.

The pinnacle application for ontologies, therefore, is as coherent reference structures whose purpose is to help map and integrate other structures and information. Given the huge heterogeneity of information both within and without organizations, the use of ontologies as integration frameworks will likely emerge as their most valuable use.

What Makes for a Good Ontology?

Good ontology practice has aspects both in terms of scope and in terms of construction.

Scope Considerations

Here are some scoping and design questions that we believe should be answered in the positive in order for an ontology to meet good practice standards:

Does the ontology provide balanced coverage of the subject domain? This question gets at the issue of properly scoping and bounding the subject coverage of the ontology. It also means that the breadth and depth of the coverage is roughly equivalent across its scope
Does the ontology embed its domain coverage into a proper context? A major strength of ontologies is their potential ability to interoperate with other ontologies. Re-using existing and well-accepted vocabularies and including concepts in the subject ontology that aid such connections is good practice. The ontology should also have sufficient reference structure for guiding the assignment of what content “is about”
Are the relationships in the ontology coherent? The essence of coherence is that it is a state of logical, consistent connections, a logical framework for integrating diverse elements in an intelligent way. So while context supplies a reference structure, coherence means that the structure makes sense. Is the hip bone connected to the thigh bone, or is the skeleton incorrect?
Has the ontology been well constructed according to good practice? See next.

If these questions can be answered affirmatively, then we would deem the ontology ready for production-grade use.

Fundamental to the whole concept of coherence is the fact that experts and practitioners within domains have been looking at the questions of relationships, structure, language and meaning for decades. Though perhaps today we now finally have a broad useful data and logic model in RDF, the fact remains that massive time and effort has already been expended to codify some of these understandings in various ways and at various levels of completeness and scope. Good practice also means, therefore, that maximum leverage is made to springboard ontologies from existing structural and vocabulary assets.

And, because good ontologies also embrace the open world approach, working toward these desired end states can also be incremental. Thus, in the face of common budget or deadline constraints, it is possible initially to scope domains as smaller or to provide less coverage in depth or to use a small set of predicates, all the while still achieving productive use of the ontology. Then, over time, the scope can be expanded incrementally.

Construction Considerations

To achieve their purposes, ontologies must be both human-readable and machine-processable. Also, because they represent conceptual structures, they must be built with a certain composition.

Good ontologies therefore are constructed such that they have:

Concept definitions – the matching and alignment of things is done on the basis of concepts (not simply labels) which means each concept must be defined
A preferred label that is used for human readable purposes and in user interfaces
A “semset” – which means a series of alternate labels and terms to describe the concept. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept
Clearly defined relationships (also known as properties, attributes, or predicates) for relating two things to one another
All of which is written in a machine-processable language such as OWL or RDF Schema (among others).

In the case of ontology-driven applications using adaptive ontologies, there are also additional instructions contained in the system (often via administrative ontologies) that tell the system which types of widgets need to be invoked for different data types and attributes. This is different than the standard conceptual schema, but is nonetheless essential to how such applications are designed.

[1] This posting was at the request of a couple of Structured Dynamics‘ customers that desired a way to describe ontologies to non-technical management. For a more in depth treatment, see M.K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog, May 16, 2007.

Posted:August 2, 2010

Citizen Dan Goes Live; Available for Download

Discover and Play with this Demo of the Open Semantic Framework

Today, Structured Dynamics is pleased to make its Citizen Dan application available for public viewing, play and downloading for the first time.

Citizen Dan is a free, open source system available to any community and its citizens to measure and track indicators of local well being. It can be branded and themed for local needs. It is under active development by Structured Dynamics with support from a number of innovative cities.

Citizen Dan is an exemplar instance of Structured Dynamics’ open semantic framework (OSF), a generalized framework for deploying semantic platforms for any domain. By changing its guiding ontologies and source content and data, what appears for Citizen Dan can be adopted for virtually any subject area.

As configured, the Citizen Dan OSF instance is a:

Appliance for slicing-and-dicing and analyzing data specific to local community indicators
Framework for dynamically navigating, interacting with, or browsing data and concepts
Means to visualize local data over time or by neighborhood
Meeting place for the public to upload and share local data and information
Web data portal that can be individually tailored by any local community
Potential node in a global network of communities across which to compare indicators of community well-being.

Unique Concept Explorer for
dynamic discovery and navigation

Citizen Dan’s information sources may include Census data, the Web, real-time feeds, government datasets, municipal government information systems, or crowdsourced data. Information can range from standard structured data to local narratives, including from minutes and reports, contributed stories, blogs or news outlets. The ‘raw’ input data can come in essentially any format, which is then converted to a standard form with consistent semantics.

Text and narratives and the concepts and entities they describe are integrally linked into the system via information extraction and tagging. All ingested information, whether structured or text sources, with their semantics, can be exported in multiple formats. A standard organizing schema, also open source and extensible or modifiable by all users, is provided via the optional MUNI ontology (with vocabulary details in development here), being developed expressly for Citizen Dan and its community indicator system purposes.

All of the community information contained within a Citizen Dan instance is available as linked data.

Overview of Features

Here are the main components or widgets to this Citizen Dan demo:

Integration of text and stories via subject
concept or named entities (scones) tagging

Concept Explorer — this Flex widget (also called the Relation Browser) is a dynamic navigator of the concept space (ontology) that is used to organize the content on the instance. Clicking on a bubble causes it to assume the central position in the diagram, with all of its connecting concepts shown. Clicking on a branch concept then causes that new node to assume the central position, enabling one to “swim through” the overall concept graph. For this instance of Citizen Dan, the MUNI ontology is used; a diagram shows the full graph of the MUNI structure. See further the concept explorer’s technical documentation
Story Viewer — any type of text content (such as stories, blog posts, news articles, local government reports, city council minutes, etc.) can be submitted to the system. This content is then tagged using the scones system (subject concepts or named entities), which then provides the basis for linking the content with concepts and other data. The story viewer is a Flex widget that highlights these tags in the content and allows searches for related content based on selected tags. See further the story viewer’s technical documentation
Map Viewer — the map viewer is a Flex widget that presents layered views of different geographic areas. The title bar of the viewer allows different layers to be turned on and off. Clicking on various geographic areas can invoke specific data and dashboard views. See further the map viewer’s technical documentation

Mapping with data highlights for all
neighborhood and census tract data

Charting Widgets — the system provides a variety of charting options for numeric data, including pie, line and bar charts. These can be called directly or sprinkled amongst other widgets based on a dashboard specification (see below)
Filter Component — the filter, or browse, component provides the ability to slice-and-dice the information space by a choice of dataset, type of data or data attribute. These slices then become filter selections which can be persisted across various visualizations or exports. See further the browse component’s technical documentation
Search Component — this component provides full-text, faceted search across all content in the system; it may be used in conjunction with the filtering above to restrict the search space to the current slice. See further the search tool’s technical documentation
Dashboard Viewer — a dashboard is a particular layout of one or more visualization widgets and a set (or not) of content filtering conditions to be displayed on a canvas. Dashboard views are created in the workbench (see next) and given a persistent name for invoking and use at any other location in the application

A variety of charts and graphs
for all numerical data

Workbench — this rather complex component is generally intended to be limited to site administrators. Via the workbench, records and datasets and attributes may be selected, and then particular views or widgets obtained. When no selections are made in the left-hand panel, all are selected by default. Then, in the records viewer (middle upper), either records or attributes are selected. For each attribute (column), a new display widget appears. All display widgets interact (a selection in one reflects in the others). The nature of the data type or attribute selected determines which available widgets are available to display it; sometimes there are multiples which can be selected via the lower left dropdown list in any given display panel. These various display widgets may then be selected for a nameable layout as a persistent dashboard view (functionality not shown in this public demo)
Exporter — the exporter component appears in multiple locations across the appliance, either as a tab option (e.g., Filter component) or as a dropdown list to the lower right of many screens. A variety (and growing!) number of export formats are available. When it appears as a dropdown list, the export is limited to the currently active slice. When invoked via tab, more export selection options are available. See further the technical documentation for this component

Dashboard provides indicator comparisons
across time and areas

Limitations of the Online Demo

A number of other tools are available to admins in the actual appliance, but are not exposed in the demo:

Importer — like the exporter, there are a variety of formats supported for ingesting data or content into the system. Prominent ones include spreadsheets (CSV), XML and JSON. The irON notation is especially well suited for dataset staging for ingest. At import time, datasets can also be appended or merged. See further the technical documentation for this component
Dataset Submission and Management — new datasets can be defined, updated, deleted, appended and granted various access rights and permissions, including to the granularity of individual components or tools. For example, see further this technical documentation
Records Manager — every dataset can have its records managed via so-called CRUD rights. Depending on the dataset permissions, a given user may or may not see these tools. See further the technical documentation for each of these create – read – update – delete tools.

In addition, it is not possible in the demo to save persistent dashboard views or submit stories or documents for tagging, nor to register as a user or view the admin portions of the Drupal instance.

Powerful and persistent “slicing-and-dicing”
across all datasets and data structure

Sample Data and Content in the Demo

The sample data and content in the demo is for the Iowa City (IA) metropolitan statistical area. This area embraces two counties (Johnson and Washington) and the census tracts and townships that comprise them, and about two dozen cities. Two of the notable cities are Iowa City itself, home of the University of Iowa, and Coralville, where Structured Dynamics, the developer of Citizen Dan and the open semantic framework (OSF), is headquartered.

The text content on this site is drawn from Wikipedia articles dealing with this area. About 30 stories are included.

The data content on the site is drawn from US Census Bureau data. Shape files for the various geographic areas were obtained from here, and the actual datasets by geographic area can be obtained from here.

The Workbench is the admin tool to
name and create new Dashboard views

An Instance of the Open Semantic Framework

Citizen Dan is an exemplar instance of Structured Dynamics’ open semantic framework (OSF), a generalized framework for deploying semantic platforms for specific domains.

OSF is a combination of a layered architecture and modular software. Most of the individual open source software products developed by Structured Dynamics and available on the OpenStructs site are components within the open semantic framework. These include:

conStruct Drupal modules
structWSF Web services framework
semantic components (Flex widgets), and
irON (instance record object notation) for dataset ingest and exchange.

Any data “slice” can be imported or exported
as structured data (e.g., RDF, XML, JSON, CSV)

A Part of the ‘Total Open Solution‘

The software that makes up the Citizen Dan appliance is one of the four legs that provide a stable, open source solution. These four legs are software, structure, methods and documentation. When all four are provided, we can term this a total open solution.

For Citizen Dan, the complements to this software are:

MUNI ontology, which provides the structure specification upon which the software runs, and
DocWiki (with its TechWiki subset of technical documentation) that provides the accompanying knowledge base of methods, best practices and other guidance.

In its entirety, the total open solution amounts to a form of capacity building for the enterprise.

Admins have a wealth of tools for dataset and
records CRUD and management.

The Potential for a Citizen Dan Network

Inherent in the design and architecture of Citizen Dan is the potential for each instance (single installation) to act as a node in a distributed network of nodes across the Web. Via the structWSF Web service endpoints and appropriate dataset permissions, it is possible for any city in the Citizen Dan network to share (or not) any or all of its data with other cities.

This collaboration aspect has been “baked into the cake” from Day One. The system also supports differential access, rights and roles by dataset and Web service. Thus, city staffs across multiple communities could share data differently than what is provided to the general public.

Since all data management aspects of each Citizen Dan instance is also oriented around datasets, expansion to a network mode is quite straightforward.

Citizen Dan is hosted in Drupal, with rich portal,
theming and 6500 add-ons available

How to Get the System

The Citizen Dan appliance is based on the Drupal content management system, which means any community can easily theme or add to the functionality of the system with any of the available 6500 open source modules that extend the basic Drupal functionality.

All other components, including the multiple third-party ones, are also open source.

To install Citizen Dan for your own use, you need to:

Download and install all of the software components. You may also want to check out the OSF discussion forum for tips and ideas about alternative configuration options
Install a baseline vocabulary. In the case of Citizen Dan, this is the MUNI ontology. MUNI is imminent for public release. Please contact the project if you need an early copy
Install your own datasets. You may want to inspect the sample Citizen Dan datasets and learn more about the irON notation, especially its commON (spreadsheet) use case.

(Note: there will also be some more updates in August, including the MUNI release.)

For questions and additional info, please consult the TechWiki or the OpenStructs community site.

Finally, please contact us if you’d like to learn more about the project, investigate funding or sponsorship opportunities, or contribute to development. We’d welcome your involvement!

Posted:July 26, 2010

Using Wikis as Pre-Packaged Knowledge Bases

While Also Discovering Hidden Publication and Collaboration Potentials

TechWiki

DocWiki

A few weeks back I completed a three-part introductory series to what Structured Dynamics calls a ‘total open solution‘. A total open solution as we defined it is comprised of software, structure, methods and documentation. When provided in toto, these components provide all of the necessary parts for an organization to adopt new open source solutions on its own (or with the choice of its own consultants and contractors). A total open solution fulfills SD’s mantra that, “We’re successful when we’re not needed.”

Two of the four legs to this total open solution are provided by documentation and methods. These two parts can be seen as a knowledge base that instructs users on how to select, install, maintain and manage the solution at hand.

Today, SD is releasing publicly for the first time two complementary knowledge bases for these purposes: TechWiki, which is the technical and software documentation complement, in this case based around SD’s Open Semantic Framework and its associated open source software projects; and DocWiki, the process methodology and project management complement that extends this basis, in this case based around the Citizen Dan local community open data appliance.

All of the software supporting these initiatives is open source. And, all of the content in the knowledge bases is freely available under a Creative Commons 3.0 license with attribution.

Mindset and Objectives

In setting out the design of these knowledge bases, our mindset was to enable single-point authoring of document content, while promoting easy collaboration and rollback of versions. Thus, the design objectives became:

A full document management system
Multiple author support
Authors to document in a single, canonical form
Collaboration support
Mixing-and-matching of content from multiple pages and articles to re-purpose for different documents, and
Excellent version/revision control.

Assuming these objectives could be met, we then had three other objectives on our wish list:

Single source publishing: publish in multiple formats (HTML, PDF, doc, csv, RTF?)
Separate theming of output products for different users, preferably using CSS, and
Single-click export of the existing knowledge base, followed by easy user modification.

Our initial investigations looked at conventional content and document management systems, matched with version control systems or SVNs. Somewhat surprisingly, though, we found the Mediawiki platform to fulfill all of our objectives. Mediawiki, as detailed below, has evolved to become a very mature and capable documentation platform.

While most of us know Mediawiki as a kind of organic authoring and content platform — as it is used on Wikipedia and many other leading wikis — we also found it perfect for our specific knowledge base purposes. To our knowledge, no one has yet set up and deployed Mediawiki in the specific pre-packaged knowledge base manner as described herein.

TechWiki v DocWiki

TechWiki is a Mediawiki instance designed to support the collaborative creation of technical knowledge bases. The TechWiki design is specifically geared to produce high-quality, comprehensive technical documentation associated with the OpenStructs open source software. This knowledge base is meant to be the go-to source for any and all documentation for the codes, and includes information regarding:

Coding and code development
Systems configurations and architectures
Installation
Set-up and maintenance
Best practices in these areas
Technical background information, and
Links to external resources.

As of today, TechWiki contains 187 articles under 56 categories, with a further 293 images. The knowledge base is growing daily.

DocWiki is a sibling Mediawiki instance that contains all TechWiki material, but has a broader purpose. Its role is to be a complete knowledge base for a given installation of an Open Semantic Framework (in the current case, Citizen Dan). As such, it needs to include much of the technical information in the TechWiki, but also extends that in the following areas:

Relation and discussion of the approach viz. other information development initiatives
Use of a common information management framework and vocabulary (MIKE2.0)
A five-phased, incremental approach to deployment and use
Specific tasks, activities and phases under which this deployment takes place, including staff roles, governance and outcome measurement
Supporting background material useful for executive management and outside audiences.

The methodology portions of the DocWiki are drawn from the broader MIKE2.0 (Method for Integrated Knowledge Environments) approach. I have previously written about this open source methodology championed by Bearing Point and Deloitte.

As of today, DocWiki contains 357 articles and 394 structured tasks in 70 activity areas under 77 categories. Another 115 images support this content. This knowledge base, too, is growing daily.

Both of these knowledge bases are open source and may be exported and installed locally. Then, users may revise and modify and extend that pre-packaged information in any way they see fit.

Basic Wiki Overview

The basic design of these systems is geared to collaboration and embeds what we think are really responsive work flows. These extend from supporting initial idea noodling to full-blown public documentation. The inherent design of the system also supports single-source publishing and book or PDF creation from the material that is there. Here is the basic overview of the design:

(click for full size)

Mediawiki provides the standard authoring and collaboration environment. There are a choice of editing methods. As content is created, it is organized in a standard way and stored in the knowledge base. The Mediawiki API supports the export of information in either XHTML or XML, which in turn allows the information to be used in external apps (including other Mediawiki instances) or for various single-source publication purposes. The Collection extension is one means by which PDFs or even entire books (that is, multi-page documents with potentially chapters, etc.) may be created. Use of a well-designed CSS ensures that outputs can be readily styled and themed for different purposes or audiences.

As wikis designed from the get-go to be reusable, and then downloaded and installed locally, it is important that we maintain quality and consistency across content. (After download, users are free to do with it as they wish, but it is important the initial database be clean and coherent.) The overall interaction with the content thus occurs via one of three levels: 1) simple reading, which is publicly available without limitation to any visitor, including source inspection and export; 2) editing and authoring, which is limited to approved contributors; and 3) draft authoring and noodling, which is limited to the group in #2 but for which the in-progress content is not publicly viewable. Built-in access rights in the system enable these distinctions.

Features and Benefits

Besides meeting all of the objectives noted at the opening of this post, these wikis (knowledge bases) also have these specific features:

Relatively complete (and growing) knowledge base content
Book, PDF, or XHTML publishing
Single-click exports and imports
Easy branding and modification of the knowledge bases for local use (via the XML export files)
Pre-designed, standard categorization systems for easy content migration
Written guidance on use and best practices
Ability to keep content in-development “hidden” from public viewing
Controlled, assisted means for assigning categories to content
Direct incorporation of external content
Efficient multi-category search and filtering
Choice of regular wikitext, WikED or rich-text editing
Standard embeddable CSS objects
Semantic and readily themed CSS for local use and for specialty publications
Standard templates
Sharable and editable images (SVG inclusion in process)
Code highlighting capabilities (GeSHi, for TechWiki)
Pre-designed systems for roles, tasks and activities (DocWiki)
Semantic Mediawiki support and forms (DocWiki)
Guided navigation and context (DocWiki).

Many of these features come from the standard extensions in the TechWiki/DocWiki packages.

The net benefits from this design are easily shared and modified knowledge bases that users and organizations may either contribute to for the broader benefit of the OpenStructs community, or download and install with simple modifications for local use and extension. There is actually no new software in this approach, just proper attention to packaging, design, standardization and workflow.

A Smooth Workflow

Via the sharing of extensions, categories and CSS, it is quite easy to have multiple instances or authoring environments in this design. For Structured Dynamics, that begins with our own internal wiki. Many notes are taken and collected there, some of a proprietary nature and the majority not intended or suitable for seeing public release.

Content that has developed to the point of release, however, can be simply tagged using conventions in the workflow. Then, with a single Export command, the relevant content is then sent to an XML file. (This document can itself be edited, such as for example changing all ‘TechWiki’ references to something like ‘My Content Site’; see further here.)

Depending on the nature of the content, this exported content may then be imported with a single Import command to either the TechWiki or DocWiki sites. (Note: Import does require admin rights.) A simple migration may also occur from the TechWiki to the DocWiki. Also, of course, initial authoring may begin at any of the sites, with collaborators an explicit feature of the TechWiki or DocWiki versions.

Any DocWiki can also be specifically configured for different domains and instance types. In terms of our current example, we are using Citizen Dan, but that could be any such Open Semantic Framework instance type:

(click for full size)

Under this design, then, the workflow suggests that technical content authoring and revision take place within the TechWiki, process and methodology revision in the DocWiki. Moreover, most DocWikis are likely to be installed locally, such that once installed, their own content would likely morph into local methods and steps.

So long as page titles are kept the same, newer information can be updated on any target wiki at any time. Prior versions are kept in the version history and can be reinstated. Alternatively, if local content is clearly diverging yet updates of initial source material is still desired, the local content need only be saved under a new title to preserve it from import overwrites.

Where Is It Going from Here?

We are really excited by this design and have already seen benefits in our own internal work and documentation. We see, for example, easier management of documentation and content, permanent (canonical) URLs for specific content items, and greater consistency and common language across all projects and documentation. Also, when all documentation is consolidated into one point with a coherent organizational and category structure, documentation gaps and inconsistencies also become apparent and can readily be fixed.

Now, with the release of these systems to the OpenStructs (Open Semantic Framework) and Citizen Dan communities, we hope to see broader contributions and expansion of the content. We encourage you to check on these two sites periodically to see how the content volume continues to grow! And, we welcome all project contributors to join in and help expand these knowledge bases!

We think this general design and approach — especially in relation to a total open solution mindset — has much to recommend it for other open source projects. We think these systems, now that we have designed and worked out the workflows, are amazingly simple to set up and maintain. We welcome other projects to adopt this approach for their own. Let us know if we can be of assistance, and we welcome ideas for improvement!

Posted:July 15, 2010

Another Milestone in Semantic Enterprise Awareness

Cisco Video is a Good Starting Intro for Management

Like the seminal linked data publication by PricewaterhouseCoopers of about a year ago (see “PWC Dedicates Quarterly Technology Forecast to Linked Data“, May 29, 2009), a video released by Cisco yesterday is another signal of the emergence of the semantic enterprise.

The Cisco tech brief on The Semantic Enterprise is a quite accessible — but a bit eerie — seven-minute introduction. The video was prepared by Cisco’s Internet Business Solutions Group (IBSG), with Shaun Kirby, its Director of Innovations Architectures, as the narrator:

YouTube: http://www.youtube.com/watch?v=3lUzs2I8BKI

Well, as for being eerie, when the video first came up, I thought I was looking at an advanced, next generation avatar, perhaps a reincarnation of Douglas Adams’ Hyperland. Maybe this semantic stuff was closer at hand than we thought!

But, as it turned out, that first blush was only a reaction to how the video was shot. As it gets rolling, the Cisco video is extremely well done and informative. It is a great intro for sharing with management when contemplating your own moves to becoming a semantic enterprise.

I suggest you first view — and then bookmark — this one.

Main Links

Search