Posted:November 8, 2009

Must Read: ‘Data Smoke and Mirrors’

Mazzocchi Sounds a Warning to Linked Data Advocates

Stefano Mazzocchi has been a clear thinker for years and an innovative contributor to the community since his early leadership of the Apache Cocoon project. One of his best qualities is he speaks his mind. Now at Freebase, but previously with MIT’s Simile program, he is one of my dedicated reads via his Stefano’s Linotype blog.

His aforementioned post, Data Smoke and Mirrors, stands on its own, and I highly recommend it. He particularly focuses on the conversion of data.gov datasets to “linked data” (my quotes are purposeful). Combined with the recent poor conversion of New York Times datasets to linked data, I think he is the canary sending out a warning about a disturbing trend.

Posting linked data for its own sake — whatever the reasons — risks undercutting the premise.

We have now moved beyond “proof of concept” to the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.

Listen up, folks.

Posted:November 2, 2009

Structured Dynamics’ Product Stack

A New Slide Show Consolidates, Explains Recent Developments

Much has been happening on the Structured Dynamics front of late. Besides welcoming Steve Ardire as a senior advisor to the company, we also have been issuing a steady stream of new products from our semantic Web pipeline.

This new slide show attempts to capture these products and relate them to the various layers in Structured Dynamics’ enterprise product stack:

Structured Dynamics’s Semantic Technologies Product Stack

View more presentations from mkbergman.

The show indicates the role of scones, irON, structWSF, UMBEL, conStruct and others and how they leverage existing information assets to enable the semantic enterprise. And, oh, by the way, all of this is done via Web-accessible linked data and our practical technologies.

Enjoy!

Posted:October 18, 2009

irON: Semantic Web for Mere Mortals

New Cross-Scripting Frameworks for XML, JSON and Spreadsheets

On behalf of Structured Dynamics, I am pleased to announce our release into the open source community of irON — the instance record and Object Notation — and its family of frameworks and tools [1]. With irON, you can now author and conduct business solely in the formats and tools most familiar and comfortable to you, all the while enabling your data to interact with the semantic Web.

irON is an abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON).

The surprising thing about irON is that — by following its simple conventions and vocabulary — you will be authoring and creating interoperable RDF datasets without doing much different than your normal practice.

This first specification for the irON notation includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. In this newly published irON specificatiion, profiles and examples are also provided for each of the irXML, irJSON and commON serializations. The irON release also includes a number of parsers and converters of the specification into RDF [2]. Data ingested in the irON frameworks can also be exported as RDF and staged as linked data.

UPDATE: Fred Giasson announced on his blog today (10/20) the release of the irJSON and commON parsers.

Background and Rationale

The objective of irON is to make it easy for data owners to author, read and publish data. This means the starting format should be a human readable, easily writable means for authoring and conveying instance records (that is, instances and their attributes and assigned values) and the datasets that contain them. Among other things, this means that irON‘s notation does not use RDF “triples“, but rather the native notations of the host serializations.

irON is premised on these considerations and observations:

RDF (Resource Description Framework) is a powerful canonical data model for data interoperability [3]
However, most existing data is not written in RDF and many authors and publishers prefer other formats for various reasons
Many formats that are easier to author and read than RDF are variants of the attribute-value pair construct [4], which can readily be expressed as RDF, and
A common abstract notation for converting to RDF would also enable non-RDF formats to become somewhat interchangeable, thus allowing the strengths of each to be combined.

The irON notation and vocabulary is designed to allow the conceptual structure (“schema”) of datasets to be described, to facilitate easy description of the instance records that populate those datasets, and to link different structures for different schema to one another. In these manners, more-or-less complete RDF data structures and instances can be described in alternate formats and be made interoperable. irON provides a simple and naïve information exchange notation expressive enough to describe most any data entity.

The notation also provides a framework for extending existing schema. This means that irON and its three serializations can represent many existing, common data formats and standards, while also providing a vehicle for extending them. Another intent of the specification is to be sparse in terms of requirements. For instance, this reserved vocabulary is fairly minimal and optional in most all cases. The irON specification supports skeletal submissions.

irON Concepts and Vocabulary

The aim of irON is to describe instance records. An instance record is simply a means to represent and convey the information (”attributes”) describing a given instance. An instance is the thing at hand, and need not represent an individual; it could, for example, represent the entire holdings or collection of books in a given library. Such instance records are also known as the ABox [5]. The simple design of irON is in keeping with the limited roles and work associated with this ABox role.

Attributes provide descriptive characteristics for each instance. Every attribute is matched with a value, which can range from descriptive text strings to lists or numeric values. This design is in keeping with simple attribute-value pairs where, in using the terminology of RDF triples, the subject is the instance itself, the predicate is the attribute, and the object is the value. irON has a vocabulary of about 40 reserved attribute terms, though only two are ever required, with a few others strongly recommended for interoperability and interface rendering purposes.

A dataset is an aggregation of instance records used to keep a reference between the instance records and their source (provenance). It is also the container for transmitting those records and providing any metadata descriptions desired. A dataset can be split into multiple dataset slices. Each slice is written to a file serialized in some way. Each slice of a dataset shares the same <id> of the dataset.

Instances can also be assigned to types, which provide the set or classificatory structure for how to relate certain kinds of things (instances) to other kinds of things. The organizational relationships of these types and attributes is described in a schema. irON also has conventions and notations for describing the linkage of attributes and types in a given dataset to existing schema. These linkages are often mapped to established ontologies.

Each of these irON concepts of records, attributes, types, datasets, schema and linkages share similar notations with keywords signaling to the irON parsers and converters how to interpret incoming files and data. There are also provisions for metadata, name spaces, and local and global references.

In these manners, irON and its three serializations can capture virtually the entire scope and power of RDF as a data model, but with simpler and familiar terminology and constructs expected for each serialization.

The Three Serializations

For different reasons and for different audiences, the formats of XML, JSON and CSV (spreadsheets) were chosen as the representative formats across which to formulate the abstract irON notation.

XML, or eXtensible Markup Language, has become the leading data exchange format and syntax for modern applications. It is frequently adopted by industry groups for standards and standard exchange formats. There is a rich diversity of tools that support the language, importantly including capable parsers and query languages. There is also a serialization of RDF in XML. As implemented in the irON notation, we call this serialization irXML.

JSON, the JavaScript Object Notation, has become very popular as a Web 2.0 data exchange format and is often the format of choice to drive JavaScript applications. There is a growing richness of tools that support JSON, including support from leading Web and general scripting languages such as JavaScript, Python, Perl, Ruby and PHP. JSON is relatively easy to read, and is also now growing in popularity with lightweight databases, such as CouchDB. As implemented in the irON notation, we call this serialization irJSON.

CSV, or comma-separated values, is a format that has been in existence for decades. It was made famous by Microsoft as a spreadsheet exchange format, which makes CSV very useful since spreadsheets are the most prevalent data authoring environment in existence. CSV is less expressive and capable as a data format than the other irON serializations, yet still has a attribute-value pair orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc. As implemented in the irON notation, we call this serialization commON.

The following diagram shows how these three formats relate to irON and then the canonical RDF target data model:

We have used the unique differences amongst XML, JSON and CSV to guide the embracing abstract notations within irON. Note the round-tripping implications of the framework.

One exciting prospect for the design is how, merely by following the simple conventions within irON, each of these three data formats — and RDF !! — can be used more-or-less interchangeably, and can be used to extend existing schema within their domains.

Links, References and More

This first release of irON is in version 0.8. Updates and revisions are likely with use. Here are some key links for irON:

The irON specification, also available in download as a PDF
The irON code and vocabulary release site, and
The Google discussion group for the irON notation.

Mid-week, the parsers and converters for structWSF [6] will be released and announced on Fred Giasson’s blog.

In addition, within the next week we will be publishing a case study of converting the Sweet Tools semantic Web and -related tools dataset to commON.

The irON specification and notation by Structured Dynamics LLC is licensed under a Creative Commons Attribution-Share Alike 3.0. irON‘s parsers or converters are available under the Apache License, Version 2.0.

Editors’ Notes

irON is an important piece in the semantic enterprise puzzle that we are building at Structured Dynamics. It reflects our belief that knowledge workers should be able to author and create interoperable datasets without having to learn the arcana of RDF. At the same time we also believe that RDF is the appropriate data model for interoperability. irOn is an expression of our belief that many data formats have appropriate places and uses; there is no need to insist on a single format.

We would like to thank Dr. Jim Pitman for his advocacy of the importance of human-readable and easily authored datasets and formats. Via his leadership of the Bibliographic Knowledge Network (BKN) project and our contractual relationship with it [7], we have learned much regarding the BKN’s own format, BibJSON. Experience with this format has been a catalytic influence in our own work on irON.

— Mike Bergman and Fred Giasson, editors

[1] Please see here for how irON fits within Structured Dynamics’ vision and family of products.

[2] Presently parsers and converters are available for the irJSON and commON serializations, and will be released this week. We have tentatively spec’ed the irXML converter, and would welcome working with another party to finalize a converter. Absent an immediate contribution from a third party, contractual work will likely result in our completing the irXML converter within the reasonable future.

[3] A pivotal premise of irON is the desirability of using the RDF data model as the canonical basis for interoperable data. RDF provides a data model capable of representing any extant data structure and any extant data format. This flexibility makes RDF a perfect data model for federating across disparate data sources. For a detailed discussion of RDF, see Michael K. Bergman, 2009. “Advantages and Myths of RDF,” in AI3 blog, April 8, 2009. See https://www.mkbergman.com/483/advantages-and-myths-of-rdf/.

[4] An attribute-value system is a basic knowledge representation framework comprising a table with columns designating “attributes” (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating “objects” (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table.

Attribute-values can also be presented as pairs in the form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array.

[5]We use the reference to the “ABox” and “TBox” in accordance with this working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”

[6] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a structWSF layer may be placed over virtually any existing datastore with Web access — including large instance record stores in existing relational databases — it is also a framework for Web-wide deployments and interoperability.

[7] BKN is a project to develop a suite of tools and services to encourage formation of virtual organizations in scientific communities of various types. BKN is a project started in September 2008 with funding by the NSF Cyber-enabled Discovery and Innovation (CDI) Program (Award # 0835851). The major participating organizations are the American Institute of Mathematics (AIM), Harvard University, Stanford University and the University of California, Berkeley.

Posted:October 11, 2009

The Law of Linked Data

A Marshal to Bring Order to the Town of Data Gulch

Though not the first, I have been touting the Linked Data Law for a couple of years now [1]. But in a conversation last week, I found that my colleague did not find the premise very clear. I suspect that is due both to cryptic language on my part and the fact no one has really tackled the topic with focus. So, in this post, I try to redress that and also comment on the related role of linked data in the semantic enterprise.

Adding connections to existing information via linked data is a powerful force multiplier, similar to Metcalfe’s law for how the value of a network increases with more users (nodes). I have come to call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects.

“In the network economy, the connections are as important as the nodes.” [2]

An early direct mention of the semantic Web and its possible ability to generate network effects comes from a 2003 Mitre report for the government [3]. In it, the authors state, “At present a very small proportion of the data exposed on the web is marked up using Semantic Web vocabularies like RDF and OWL. As more data gets mapped to ontologies, the potential exists to achieve a ‘network effect’.” Prescient, for sure.

In July 2006, both Henry Story and Dion Hinchliffe discussed Metcalfe’s law, with Henry specifically looking to relate it to the semantic Web [4]. He noted that his initial intuition was that “the value of your information grows exponentially with your ability to combine it with new information.” He noted he was trying to find ways to adapt Metcalfe’s law for applicability to the semantic Web.

I picked up on those observations and commented to Henry at that time and in my own post, “The Exponential Driver of Combining Information.” I have been enamoured of the idea ever since, and have begun to weave the idea into my writings.

More recently, in late 2008, James Hendler and Jennifer Golbeck devoted an entire paper to Metcalfe’s law and the semantic Web [5]. In it, they note:

“This linking between ontologies, and between instances in documents that refer to terms in another ontology, is where much of the latent value of the Semantic Web lies. The vocabularies, and particularly linked vocabularies using URIs, of the Semantic Web create a graph space with the ability to link any term to any other. As this link space grows with the use of RDF and OWL, Metcalfe’s law will once again be exploited – the more terms to link to, and the more links created, the more value in creating more terms and linking them in.”

A Refresher on Metcalfe’s Law

Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n²) (note: it is not exponential, as some of the points above imply). Robert Metcalfe formulated it about 1980 in relation to Ethernet and fax machines; the “law” was then named for Metcalfe and popularized by George Gilder in 1993.

These attempts to estimate the value of physical networks were in keeping with earlier efforts to estimate the value of a broadcast network. That value is almost universally agreed to be proportional to the number of users, as accepted as Sarnoff’s law (see further below).

The actual algorithm proposed by Metcalfe calculates the number of unique connections in a network with n nodes to be n(n − 1)/2, which is proportional to n². This makes Metcalfe’s law a quadratic growth equation.

As nodes get added, then, we see the following increase in connections:

‘Network Effect’ for Physical Networks

This diagram, modified from Wikipedia to be a horizontal image, shows how two telephones can make only one connection, five can make 10 connections, and twelve can make 66 connections, etc.

By definition, a physical network is a connected network. Thus, every time a new node is added to the network, connections are added, too. This general formula has also been embraced as a way to discuss social connections on the Internet [6].

Analogies to Linked Data

Like physical networks, the interconnectedness of the semantic Web or semantic enterprise is a graph.

The idea behind linked data is to make connections between data. Unlike physical telecommunication networks, however, the nodes in the form of datasets and data are (largely) already there. What is missing are the connections. The build-out and growth that produces the network effects in a linked data context do not result from adding more nodes, but from the linking or connecting of existing nodes.

The fact that adding a node to a physical network carries with it an associated connection has tended to conjoin these two complementary requirements of node and connection. But, to grok the real dynamics and to gain network effects, we need to realize: Both nodes and connections are necessary.

One circumstance of the enterprise is that data nodes are everywhere. The fact that the overwhelming majority are unconnected is why we have adopted the popular colloquialism of data “silos”. There are also massive amounts of unconnected data on the Web in the form of dynamic databases only accessible via search form, and isolated data tables and listings virtually everywhere.

Thus, the essence of the semantic enterprise and the semantic Web is no more complicated than connecting — meaningfully — data nodes that already exist.

As the following diagram shows, unconnected data nodes or silos look like random particles caught in the chaos of Brownian motion:

‘Network Effect’ for Coherent Linked Data

As initial connections get made, bits of structure begin to emerge. But, as connections are proliferated — exactly equivalant to the network effects of connected networks — coherence and value emerge.

Look at the last part in the series diagram above. We not only see that the same nodes are now all connected, with the inferences and relationships that result from those connections, but we can also see entirely new structures emerge by virtue of those connections. All of this structure and meaning was totally absent prior to making the linked data connections.

Quantifying the Network Effect

So, what is the benefit of this linked data? It depends on the product of the value of the connections and the multiplier of the network effect:

linked data benefit = connections value X network effect multiplier

Just as it is hard to have a conversation via phone with yourself, or to collaborate with yourself, the ability to gain perspective and context from data comes from connections. But like some phone calls or some collaborations, the value depends on the participants. In the case of linked data, that depends on the quality of the data and its coherence [7]. The value “constant” for connected linked data depends in some manner on these factors, as well as the purposes and circumstances to which that linked data might be applied.

Even in physical networks or social collaboration contexts, the “value” of the network has been hard to quantify. And, while academics and researchers will appropriately and naturally call for more research on these questions, we do not need to be so timid. Whatever the alpha constant is for quantifying the value of a linked data network, our intuition should be clear that making connections, finding relationships, making inferences, and making discoveries can not occur when data is in isolation.

Because I am an advocate, I believe this alpha constant of value to be quite large. I believe this constant is also higher for circumstances of business intelligence, knowledge management and discovery.

The second part of the benefit equation is the multiplier for network effects. We’ve mentioned before the linear growth advantage due to broadcast networks (Sarnoff law) and the standard quadratic growth assumption of physical and social networks (Metcalfe law). Naturally, there have been other estimates and advocacies.

David Reed [8], for example, also adds group effects and has asserted an exponential multiplier to the network effect (like Henry Story’s initial intuition noted above). As he states,

“[E]ven Metcalfe’s Law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with n members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2ⁿ. So the value of a GFN increases exponentially, in proportion to 2ⁿ. I call that Reed’s Law. And its implications are profound.”

Yet not all agree with the assertion of an exponential multiplier, let alone the quadratic one of Metcalfe. Odlyzko and Tilly [9] note that Metcalfe’s law would hold if the value that an individual gets personally from a network is directly proportional to the number of people in that network. But, then they argue that does not hold because of local preferences or different qualities of interaction. In a linked data context, such arguments have merit, though you may also want to see Metcalfe’s own counter-arguments [6].

Hinchliffe’s earlier commentary [4] provided a nice graphic that shows the implications of these various multiplers on the network effect, as a function of nodes in a network:

Potency of the Network Effect from Dion Hinchliffe

Various Estimates for the ‘Network Effect’

I believe we can dismiss the lower linear bound of this question and likely the higher exponential one as well (that is, Reed’s law, because quality and relevance questions make some linked data connections less valuable than others). Per the above, that would suggest that the multiplier of the linked data network is perhaps closer to the Metcalfe estimate or similar.

In any event, it is also essential to point out that connecting data indiscriminantly for linked data’s sake will likely deliver few, if any, benefits. Connections must still be coherent and logical for the value benefits to be realized.

The Role and Contribution of Linked Data

I elsewhere discuss the role of linked data in the enterprise and will continue to do so. But, there are some implications in the above that warrant some further observations.

It should be clear that the graph and network basis of linked data, not to mention some of the uncertainties as to quantifying benefits, suggests the practice should be considered apart from mission-critical or transactional uses in the enterprise. That may change with time and experience.

There are also open questions about data quality in terms of inputs to linked data and possible erroneous semantics and ontologies to guide the linked connections. Operational uses should be kept off the table for now. Like physical networks, not all links perform well and not all have usefulness. Similarly to how poor connections may be encountered in physical networks, they should be either taken off-ledger or relegated to a back-up basis. Linked data should be understood and treated no differently than networks of variable quality.

Such realism is important — for both internal and external linked data advocates — to allow linked data to be applied in the right venues at acceptable risk and with likely demonstrable benefits. Elsewhere I have advocated an approach that builds on existing assets; here I advocate a clear and smart understanding of where linked data can best deliver network effects in the near term.

And, so, in the nearest term, enterprise applications that best fit linked data promises and uncertainties include:

Establishing frameworks for data federation
Business intelligence
Discovery
Knowledge management and knowledge resources
Reasoning and inference
Development of internal common language
Learning and adopting data-driven apps [10], and
Staging and analysis for data cleaning.

A New Deputy Has Come to Town

As in the Wild West, the new deputy marshal and his tin badge did not guarantee prosperity. But a good marshal would deliver law and order. And those are the preconditions for the town folk to take charge of building their own prosperity.

Linked data is a practice for starting to bring order and connections to your existing data. Once some order has been imposed, the framework then becomes a basis for defining meanings and then gaining value from those connections.

Once order has been gained, it is up to the good citizens of Data Gulch to then deliver the prosperity. Broad participation and the network effect are one way to promote that aim. But success and prosperity still depends on intelligence and good policies and practice.

[1] I first put forward this linked data aspect in What is Linked Data?, dated June 23, 2008. I then formalized it in Structure the World, dated August 3, 2009.

[2] Paul Tearnen, 2006. “Integration in the Network Economy,” Information Management Special Reports, October 2006. See http://www.information-management.com/specialreports/20061010/1064941-1.html.

[3] Salim K. Semy, Mark Linderman and Mary K. Pulvermacher, 2003. “Information Management Meets the Semantic Web,” DOD Report by MITRE Corporation, November 2003, 10 pp. See http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA460265&Location=U2&doc=GetTRDoc.pdf.

[4] On July 15, 2006, Dion Hinchcliffe wrote, Web 2.0’s Real Secret Sauce: Network Effects. He produced a couple of useful graphics and expanded upon some earlier comments to the Wall Street Journal. Shortly thereafter, on July 29, Story wrote his own post, RDF and Metcalfe’s law, as noted. I commented on July 30.

[5] James Hendler and Jennifer Golbeck, 2008. “Metcalfe’s Law, Web 2.0, and the Semantic Web,” in Journal of Web Semantics 6(1):14-20, 2008. See http://www.cs.umd.edu/~golbeck/downloads/Web20-SW-JWS-webVersion.pdf.

[6] Robert Metcalfe, 2006. Metcalfe’s Law Recurses Down the Long Tail of Social Networking, see http://vcmike.wordpress.com/2006/08/18/metcalfe-social-networks/.

[7] See my chronological listing for additional candidates.

[8] From David P. Reed, 2001. “The Law of the Pack,” Harvard Business Review, February 2001, pp 23-4. For more on Reed’s position, see Wikipedia’s entry on Reed’s law.

[9] Andrew Odlyzko and Benjamin Tilly, 2005. A Refutation of Metcalfe’s Law and a Better Estimate for the Value of Networks and Network Interconnections, personal publication; see http://www.dtc.umn.edu/~odlyzko/doc/metcalfe.pdf.

[10] Data-driven applications are the term we have adopted for modular, generic tools that operate and present results to users based on the underlying data structures that feed them. See further the discussion of Structured Dynamics’s products.

Posted:October 9, 2009

Brown Bag Lunch: ‘Long Tails’ Have ‘Teeny Heads’

Pinheads Sometimes Get the Last Laugh

The idea of the ‘long tail’ was brilliant, and Chris Anderson’s meme has become part of our current lingo in record time. The long tail is the colloquial name for a common feature in some statistical distributions where an initial high-amplitude peak within a population distribution is followed by a rapid decline and then a relatively stable, declining low-amplitude population that “tails off.” (An asymptotic curve.) This sounds fancy; it really is not. It simply means that a very few things are very popular or topical, most everything else is not.

The following graph is a typical depiction of such a statistical distribution with the long tail shown in yellow. Such distributions often go by the names of power laws, Zipf distributions, Pareto distributions or general Lévy distributions. (Generally, such curves when plotted on a semi-logarithmic scale now show the curve to be straight, with the slope being an expression of its “power”.)

It is a common observation that virtually everything measurable on the Internet — site popularity, site traffic, ad revenues, tag frequencies on del.icio.us, open source downloads by title, Web sites chosen to be digg‘ed, Google search terms — follows such power laws or curves.

However, the real argument that Anderson made first in Wired magazine and then in his 2006 book, The Long Tail: Why the Future of Business is Selling Less of More, is that the Internet with either electronic or distributed fulfillment means that the cumulative provision of items in the long tail is now enabling the economics of some companies to move from “mass” commodities to “specialized” desires. Or, more simply put: There is money to be made in catering to individualized tastes.

I, too, agree with this argument, and it is a brilliant recognition of the fact that the Internet changes everything.

But Long Tails Have ‘Teeny Heads’

Yet what is amazing about this observation of long tails on the Internet has been the total lack of discussion of its natural reciprocal: namely, long tails have teeny heads, the red portion of the diagram. For, after all, what also is the curve above telling us? While Anderson’s point that Amazon can carry millions of book titles and still make a profit by only selling a few of each, what is going on at the other end of the curve — the head end of the curve?

Well, if we’re thinking about book sales, we can make the natural and expected observation that the head end of the curve represents sales of the best seller books; that is, all of those things in the old 80-20 world that is now being blown away with the long tail economics of the Internet. Given today’s understandings, this observation is pretty prosaic since it forms the basis of Anderson’s new long tail argument. Pre-Internet limits (it’s almost like saying before the Industrial Revolution) kept diversity low and choices few.

Okaaaay! Now that seems to make sense. But aren’t we still missing something? Indeed we are.

Social Collaboration Depends on ‘Teeny Heads’

So, when we look at many of those aspects that make up what is known as Web 2.0 or even the emerging semantic Web, we see that collaboration and user-submitted content stands at the fore. And our general power law curves then also affirm that it is a very few who supply most of that user-generated content — namely, those at the head end, the teeny heads. If those relative few individuals are not motivated, the engine that drives the social content stalls and stutters. Successful social collaboration sites are the ones that are able to marshal “large numbers of the small percentage.”

The natural question thus arises: What makes those “teeny heads” want to contribute? And what makes it so they want to contribute big — that is, frequently and with dedication? So, suddenly now, here’s a new success factor: to be successful as a collaboration site, you must appeal to the top 1% of users. They will drive your content generation. They are the ‘teeny heads’ at the top of your power curve.

Well, things just got really more difficult. We need tools, mindshare and other intangibles to attract the “1%” that will actually generate our site’s content. But we also need easy frameworks and interfaces for the general Internet population to live comfortably within the long tail.

So, heads or tails? Naahh, that’s the wrong question. Keep flipping until you get both!

This Friday brown bag leftover was first placed into the AI3 refrigerator on April 6, 2007. I remain convinced that only a few “movers” are required to start shaking the user-generated content tree. I made two changes from the original post: 1) I had to find a grainy replacement for the initial great graphic of the “teeny head”; and 2) I added the red portion to the graphic. No other changes have been made.

Main Links

Search

Author: Mike Bergman