Posted:June 6, 2006

Semantic mediation — that is, resolving semantic heterogeneities — must address more than 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language.

Earlier postings in this recent series traced the progress in climbing the data federation pyramid to today’s current emphasis on the semantic Web. Partially this series is aimed at disabusing the notion that data extensibility can arise simply by using the XML (eXtensible Markup Language) data representation protocol. As Stonebraker and Hellerstein correctly observe:

XML is sometimes marketed as the solution to the semantic heterogeneity problem . . . . Nothing could be further from the truth. Just because two people tag a data element as a salary does not mean that the two data elements are comparable. One could be salary after taxes in French francs including a lunch allowance, while the other could be salary before taxes in US dollars. Furthermore, if you call them “rubber gloves” and I call them “latex hand protectors”, then XML will be useless in deciding that they are the same concept. Hence, the role of XML will be limited to providing the vocabulary in which common schemas can be constructed.[1]

This series also covers the ontologies and the OWL language (written in XML) that now give us the means to understand and process these different domains and “world views” by machine. According to Natalya Noy, one of the principal researchers behind the Protégé development environment for ontologies and knowledge-based systems:

How are ontologies and the Semantic Web different from other forms of structured and semi-structured data, from database schemas to XML? Perhaps one of the main differences lies in their explicit formalization. If we make more of our assumptions explicit and able to be processed by machines, automatically or semi-automatically integrating the data will be easier. Here is another way to look at this: ontology languages have formal semantics, which makes building software agents that process them much easier, in the sense that their behavior is much more predictable (assuming they follow the specified explicit semantics–but at least there is something to follow). [2]

Again, however, simply because OWL (or similar) languages now give us the means to represent an ontology, we still have the vexing challenge of how to resolve the differences between different “world views,” even within the same domain. According to Alon Halevy:

When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies–or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel. [3]

In the sections below, we describe the sources for how this heterogeneity arises and classify the many different types of heterogeneity. I then describe some broad approaches to overcoming these heterogeneities, though a subsequent post looks at that topic in more detail.

Causes and Sources of Semantic Heterogeneity

There are many potential circumstances where semantic heterogeneity may arise (partially from Halevy [3]):

  • Enterprise information integration
  • Querying and indexing the deep Web (which is a classic data federation problem in that there are literally tens to hundreds of thousands of separate Web databases) [4]
  • Merchant catalog mapping
  • Schema v. data heterogeneity
  • Schema heterogeneity and semi-structured data.

Naturally, there will always be differences in how differing authors or sponsors create their own particular “world view,” which, if transmitted in XML or expressed through an ontology language such as OWL may also result in differences based on expression or syntax. Indeed, the ease of conveying these schemas as semi-structured XML, RDF or OWL is in and of itself a source of potential expression heterogeneities. There are also other sources in simple schema use and versioning that can create mismatches [3]. Thus, possible drivers in semantic mismatches can occur from world view, perspective, syntax, structure and versioning and timing:

  • One schema may express a similar “world view” with different syntax, grammar or structure
  • One schema may be a new version of the other
  • Two or more schemas may be evolutions of the same original schema
  • There may be many sources modeling the same aspects of the underlying domain (“horizontal resolution” such as for competing trade associations or standards bodies), or
  • There may be many sources that cover different domains but overlap at the seams (“vertical resolution” such as between pharmaceuticals and basic medicine).

Regardless, the needs for semantic mediation are manifest, as are the ways in which semantic heterogeneities may arise.

Classification of Semantic Heterogeneities

The first known classification scheme applied to data semantics that I am aware of is from William Kent nearly 20 years ago.[5] (If you know of earlier ones, please send me a note.) Kent’s approach dealt more with structural mapping issues (see below) than differences in meaning, which he pointed to data dictionaries as potentially solving.

The most comprehensive schema I have yet encountered is from Pluempitiwiriyawej and Hammer, “A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources.” [6] They classify heterogeneities into three broad classes:

  • Structural conflicts arise when the schema of the sources representing related or overlapping data exhibit discrepancies. Structural conflicts can be detected when comparing the underlying DTDs. The class of structural conflicts includes generalization conflicts, aggregation conflicts, internal path discrepancy, missing items, element ordering, constraint and type mismatch, and naming conflicts between the element types and attribute names.
  • Domain conflicts arise when the semantic of the data sources that will be integrated exhibit discrepancies. Domain conflicts can be detected by looking at the information contained in the DTDs and using knowledge about the underlying data domains. The class of domain conflicts includes schematic discrepancy, scale or unit, precision, and data representation conflicts.
  • Data conflicts refer to discrepancies among similar or related data values across multiple sources. Data conflicts can only be detected by comparing the underlying DOCs. The class of data conflicts includes ID-value, missing data, incorrect spelling, and naming conflicts between the element contents and the attribute values.

Moreover, mismatches or conflicts can occur between set elements (a “population” mismatch) or attributes (a “description” mismatch).

The table below builds on Pluempitiwiriyawej and Hammer’s schema by adding the fourth major explicit category of language, leading to about 40 distinct potential sources of semantic heterogeneities:

Class

Category

Subcategory

STRUCTURAL

Naming

Case Sensitivity
Synonyms
Acronyms
Homonyms
Generalization / Specialization
Aggregation Intra-aggregation
Inter-aggregation
Internal Path Discrepancy
Missing Item Content Discrepancy
Attribute List Discrepancy
Missing Attribute
Missing Content
Element Ordering
Constraint Mismatch
Type Mismatch
DOMAIN SchematicDiscrepancy Element-value to Element-label Mapping
Attribute-value to Element-label Mapping
Element-value to Attribute-label Mapping
Attribute-value to Attribute-label Mapping
Scale or Units
Precision
DataRepresentation Primitive Data Type
Data Format
DATA Naming Case Sensitivity
Synonyms
Acronyms
Homonyms
ID Mismatch or Missing ID
Missing Data
Incorrect Spelling
LANGUAGE Encoding Ingest Encoding Mismatch
Ingest Encoding Lacking
Query Encoding Mismatch
Query Encoding Lacking
Languages Script Mismatches
Parsing / Morphological Analysis Errors (many)
Syntactical Errors (many)
Semantic Errors (many)

Most of these line items are self-explanatory, but a few may not be:

  • Homonyms refer to the same name referring to more than one concept, such as Name referring to a person v. Name referring to a book
  • A generalization/specialization mismatch can occur when single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to “phone” but the other schema has multiple elements such as “home phone,” “work phone” and “cell phone”
  • Intra-aggregation mismatches come when the same population is divided differently (Census v. Federal regions for states, or full person names v. first-middle-last, for examples) by schema, whereas inter-aggregation mismatches can come from sums or counts as added values
  • Internal path discrepancies can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove)
  • The four sub-types of schematic discrepancy refer to where attribute and element names may be interchanged between schemas
  • Under languages, encoding mismatches can occur when either the import or export of data to XML assumes the wrong encoding type. While XML is based on Unicode, it is important that source retrievals and issued queries be in the proper encoding of the source. For Web retrievals this is very important, because only about 4% of all documents are in Unicode, and earlier BrightPlanet provided estimates there may be on the order of 25,000 language-encoding pairs presently on the Internet
  • Even should the correct encoding be detected, there are significant differences in different language sources in parsing (white space, for example), syntax and semantics that can also lead to many error types.

It should be noted that a different take on classifying semantics and integration approaches is taken by Sheth et al.[7] Under their concept, they split semantics into three forms: implicit, formal and powerful. Implicit semantics are what is either largely present or can easily be extracted; formal languages, though relatively scarce, occur in the form of ontologies or other descriptive logics; and powerful (soft) semantics are fuzzy and not limited to rigid set-based assignments. Sheth et al.’s main point is that first-order logic (FOL) or descriptive logic is inadequate alone to properly capture the needed semantics.

From my viewpoint, Pluempitiwiriyawej and Hammer’s [6] classification better lends itself to pragmatic tools and approaches, though the Sheth et al. approach also helps indicate what can be processed in situ from input data v. inferred or probabalistic matches.

Importance of Reference Standards

An attractive and compelling vision  — perhaps even a likely one  — is that standard reference ontologies become increasingly prevalent as time moves on and semantic mediation is seen as more of a mainstream problem. Certainly, a start on this has been seen with the use of the Dublin Core metadata initiative, and increasingly other associations, organizations, and major buyers are busy developing “standardized” or reference ontologies.[8] Indeed, there are now more than 10,000 ontologies available on the Web.[9] Insofar as these gain acceptance, semantic mediation can become an effort mostly at the periphery and not the core.

But, such is not the case today. Standards only have limited success and in targeted domains where incentives are strong. That acceptance and benefit threshold has yet to be reached on the Web. Until such time, a multiplicity of automated methods, semi-automated methods and gazetteers will all be required to help resolve these potential heterogeneities.


[1] Michael Stonebraker and Joey Hellerstein, “What Goes Around Comes Around,” in Joseph M. Hellerstein and Michael Stonebraker, editors, Readings in Database Systems, Fourth Edition, pp. 2-41, The MIT Press, Cambridge, MA, 2005. See http://mitpress.mit.edu/books/chapters/0262693143chapm1.pdf.[2] Natalya Noy, “Order from Chaos,” ACM Queue vol. 3, no. 8, October 2005 See http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=341&page=1

[3] Alon Halevy, “Why Your Data Won’t Mix,” ACM Queue vol. 3, no. 8, October 2005. See http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=336.

[4] Michael K. Bergman, “The Deep Web: Surfacing Hidden Value,” BrightPlanet Corporation White Paper, June 2000. The most recent version of the study was published by the University of Michigan’s Journal of Electronic Publishing in July 2001. See http://www.press.umich.edu/jep/07-01/bergman.html.

[5] William Kent, “The Many Forms of a Single Fact”, Proceedings of the IEEE COMPCON, Feb. 27-Mar. 3, 1989, San Francisco. Also HPL-SAL-88-8, Hewlett-Packard Laboratories, Oct. 21, 1988. [13 pp]. See http://www.bkent.net/Doc/manyform.htm.

[6] Charnyote Pluempitiwiriyawej and Joachim Hammer, “A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources,” Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See ftp.dbcenter.cise.ufl.edu/Pub/publications/tr00-004.pdf.

[7] Amit Sheth, Cartic Ramakrishnan and Christopher Thomas, “Semantics for the Semantic Web: The Implicit, the Formal and the Powerful,” in Int’l Journal on Semantic Web & Information Systems, 1(1), 1-18, Jan-March 2005. See http://www.informatik.uni-trier.de/~ley/db/journals/ijswis/ijswis1.html

[8] See, among scores of possible examples, the NIEM (National Information Exchange Model) agreed to between the US Departments of Justice and Homeland Security; see http://www.niem.gov/.

[9] OWL Ontologies: When Machine Readable is Not Good Enough

Posted by AI3's author, Mike Bergman Posted on June 6, 2006 at 6:12 pm in Adaptive Information, Semantic Web | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/
The URI to trackback this post is: https://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/trackback/
Posted:June 5, 2006

The previous posting in this series described climbing the data federation pyramid and the progress that had been made in the last decade overcoming seemingly intractable problems involving hardware, software and networks. Key enablers for that progress were adoption of Internet protocols (TCP/IP, HTML) and adoption of the XML data representation standard.

Data Federation Wanes, Semantic Web Waxes

Through the late 1990s the focus of the data federation challenge matured to overcoming federations in meaning. Today, we know this challenge to be called the semantic Web or semantic mediation. The intellectual trigger for this shift in emphasis came from Tim Berners-Lee, James Hendler, and Ora Lassila when they described their “grand vision” for the semantic Web in a Scientific American article in 2001. [1] The authors described the semantic Web as:

To date, the World Wide Web has developed most rapidly as a medium of documents for people rather than of information that can be manipulated automatically. By augmenting Web pages with data targeted at computers and by adding documents solely for computers, we will transform the Web into the Semantic Web. Computers will find the meaning of semantic data by following hyperlinks to definitions of key terms and rules for reasoning about them logically. The resulting infrastructure will spur the development of automated Web services such as highly functional agents. Ordinary users will compose Semantic Web pages and add new definitions and rules using off-the-shelf software that will assist with semantic markup.

Berners-Lee had already been proselytizing on this topic for a few years, notably in his Weaving the Web book in 1999.[2] But the Scientific American article really popularized the topic.

Researchers in the field came to rely on a diagram that Berners-Lee had also developed to explain the various protocols and challenges underlying semantic Web technologies. This diagram, often affectionately called the “birthday cake,” has gone through many iterations. Here is one of the most widely reproduced versions from a Berners-Lee talk given in 2000: [3]

The Berners-Lee Semantic Web ‘Birthday Cake’

The Layers

Note that this diagram expands on the four top layers of data representation, semantics, pragmatics and trust from the pyramid graphic in my previous climbing the data federation pyramid post. (Also, note that Ian Horrocks et al. have updated this “stack” and looked at it from the basis of current standards, including OWL and inclusion of encryption.[4])

The foundation of the “stack” is Unicode, an industry standard for digital representation of human languages, symbols and scripts, and URIs (uniform resource identifiers), which, like URLs, provide a unique and unambiguous basis for locating resources.

The next layer, as in the data federation pyramid, is XML.

The basic enabler for semantic representation comes from the next RDF + Schema layer. RDF (resource description framework) is a first-order description logic “triple” representation of subject – predicate – object. The subjects and objects are nouns or “things” with the subject needing to be described via a URI (optional for the object). The predicate is a verb that describes the relationship between subject and object, and is often expressed in syntax such as isPartof or hasSex or hasBirthplace, etc. In terms of graph theory, RDF is a directed graph where the subjects and objects are nodes and the predicate is an edge. RDF Schema extends the RDF “triple” by adding semantics that relate domains, relationships and subclasses and subproperties. RDF Schema provides very wide nteroperability, but it is minimalist and unable to capture a complete semantic logic.

The ontology layer provides more “meta” information, such as transitive, unique, unambiguous, cardinality or other properties. Based on RDF, ontology languages provide a means for conveying domain representations or “world views” electronically for machine processing. Today, the standard is OWL (Web ontology language), which grew out of the earlier OIL (EU) and DAML (US) incipient standards. (However, any internally consistent syntax and language for descriptive logic can also qualify as an ontology layer.) In OWL, there are also three levels  — or sub-languages  — cardinality. OWL DL is a computationally complete description logic (all statements can be computed and will finish in finite time). OWL Full provides the syntactic freedom of RDF with no computational guarantees. OWL Full may be necessary for a complete representation of an ontolological domain, even though it cannot be guaranteed to be internally consistent. Each of these sublanguages is an extension of its simpler predecessor.

Of course, the real rub arises when different world views need to be reconciled, or what is known as semantic mediation. In this instance, it is now necessary to invoke reconciliation logic. (Is my “glad” your “happy”? Are my countries expressed as two-letter acronyms and yours spelled out in French, and do yours include native lands in addition to nation-states?)

(In fact, the next posting in this series actually details about 40 different sources of semantic heterogeneity.)

So, even if multiple domain specifications are provided via OWL, federating them requires mediating these heterogeneities, and that requires some form of logic or rule-based expert system. Thus, in terms of standards, we have achieved the representational ways to express semantics, but the logics and rules for resolving them are open and not likely subject to standards. (Indeed, most view the semantic mediation step at best as lending itself to semi-automatic methods.)

Finally, the ‘birthday cake’ shows that even with logics in place to resolve or mediate heterogeneities, the vexing challenge of what information to trust remains, the resolution of which is perhaps aided with digital signatures or certificates.

NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semantic Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series.

[1] Tim Berners-Lee, James Hendler, and Ora Lassila, “The Semantic Web,” in Scientific American 284(5): pp 34-43, 2001. See http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2.[2] Tim Berners-Lee and Mark Fischetti, Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor, Harper, San Francisco, 226 pp., 1999.

[3] Tim Berners-Lee, “Semantic Web on XML,” at XML 2000, December 6, Washington, DC. See http://www.w3.org/2000/Talks/1206-xml2k-tbl

[4] Ian Horrocks, Bijan Parsia, Peter Patel-Schneider, and James Hendler, “Semantic Web Architecture: Stack or Two Towers?,” in Francois Fages and Sylvain Soliman, editors, Principles and Practice of Semantic Web Reasoning (PPSWR 2005), No. 3703 in LNCS, pp 37-41, 2005. See http://www.cs.man.ac.uk/~horrocks/Publications/download/2005/HPPH05.pdf.

Posted by AI3's author, Mike Bergman Posted on June 5, 2006 at 12:47 pm in Semantic Web | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/231/from-data-federation-pyramid-to-the-semantic-web-birthday-cake/
The URI to trackback this post is: https://www.mkbergman.com/231/from-data-federation-pyramid-to-the-semantic-web-birthday-cake/trackback/
Posted:June 2, 2006

 

NOTE:  This is an update of an 2005 post.

There has been a massive — but little noticed — shift in enterprise software expenditures and software company revenues in the past decade.  A "typical" enterprise software vendor could expect to obtain 70% or more of its total revenues from software license fees a decade ago.  Today, that percentage is about 35% with statistically significant trends heading toward below 10% within the decade.  These trends have signficant implications on the emerging business models necessary for software companies to be successful.

The Trends and Data

The figure below provides software license revenues as a percent of total revenues for about 120 different software companies over the past decade.  No matter the sample, there has been a steady — and signficantly strong — trend to declining license revenues.

Software Licensing Trends

The three sources for this figure are:

  • Search – my own values for Autonomy and Convera from SEC filings
  • Top 100 – these are listings compiled by Culpeper and Associates[2]
  • MIT – these values are from the MIT Sloan School of Management, using eight leading companies as referenced by Michael Cusumano [3]

The trend lines indicate continued percentage declines in the importance of software licensing. Based on these teyear trends, by 2008 conventional software licenses will account for less than 10% of total revenues for all software companies, and less than 20% for leading enterprise search vendors (Verity [now part of], Autonomy, Convera).  These trends have very high R2 values.  Seven-fold or greater drops from a position of dominance suggests a sea change is taking place in the revenue mix for software companies and the expenditure mix for enterprises.

These trends can vary significantly by software company, as the comparison table I constructed from recent SEC filings below shows:


Company

License Revenue %
Red Hat 0.0%
salesforce.com 0.0%
i2 15.0%
Compuware 23.5%
Peoplesoft 23.7%
IBM 24.6%
SAP 31.4%
Oracle 34.9%
INDUSTRY AVERAGE 35.4%
Siebel 36.4%
Business Objects 51.1%
Microsoft 76.5%
Adobe 90.0%

These values are derived from the most recent SEC filings (10Ks or 10Qs).  This table shows that companies that can truly "package" shrink-wrapped software can maintain the highest percentages of software license revenues; vendors that rely on the subscription model have the lowest percentage, often going to zero.  Large, traditional software vendors such as IBM or Oracle are below industry averages for the percent of software license percentages.  This trend is remarkable given that these larger vendors obtained 70-80% or more of their total software revenues from license fees a mere decade ago.

The abiding trend appears to be the shift from software to services, but the picture is considerably more complicated than that.

Other Software Licensing Studies

At least two comprehensive studies have been issued in the past year or so regarding software licensing trends.  The first, from IDC that involved Delphi interviews of 100 large customers and 100 major software vendors, was conducted with the support of 11 major vendors and the Software and Information Industry Association (SIIA). [1]  This study sees subscription licenses having increasing importance to vendor revenues.

This study shows that companies today budget 20% for maintenance contracts, 32% for new licenses. IDC projects that maintenance expenditures are likely to increase, license to decline. With an increased reliance on a subscription model, maintenance in fact increases as a source of revenue to the vendor. IDC projects 34% of revenue to come from subscriptions by 2008. Though the worldwide software market was about $200 billion in 2003, growth will continue, with changing fractions of the sources of revenue. Besides subscription models, maintenance fees and consulting and service fees are projected to increase while standard license fees decrease.  Vendor drivers for these trends include the long lead times of traditional enterprise software license sales and the need for more predictable revenue streams.  Customer drivers for these trends are demands for lower overall costs, a better alignment of value, and the requirement for smaller upfront costs.

The second study from Macrovision used a questionnaire methodology directed to a larger group of software executives and a similar number of customers. [4]  This study, too, was conducted in association with SIIA.  This later study, completed in late 2004, also sees subscription licenses increasing. Vendors reported a trend to subscription licensing to become 67% of license revenues, though customers exhibited more reluctance to embrace the subscription model.

Both studies showed maintenance fees to average 20-22% of initial software license fees.

What Changing Business Models are Emerging?

While the trend moving away from standard software licenses is clear, what that means in terms of winning next-generation business models is less clear. The software industry thus appears to be in flux with a period of experimentation with alternative business models prevalent.  It may be a year or three before which of these alternatives begins to emerge as the clear business model winner.

So, what are these alternatives:

  • Services – many large traditional vendors, including Novell, IBM, HP and Oracle, have seen massive percentage shifts from software licenses to consulting and services revenues.  This trend is linked both with related open source trends and the increasing need to engineer and deploy interoperable systems from multiple software vendors
  • Open Source – after steady trends to Linux in the late 1990’s and dominance of open source for Internet servers, most recently there has been an increase in open source applications and interoperable systems.  The general importance of open source trends is documented in many of the current and pending AI3 blog posts
  • Outsourcing – the outsourcing of many traditional IT and backoffice functions is a well-documented phenomenon, and
  • Subscription – though the earlier buzz for application service providers (ASP) has waned in the past two years, a similar model has emerged under the subscription or Web services monikers.  As noted above, there may be a doubling in importance of this revenue model in relation to traditional software licensing within this decade but customer enthusiasm is questionable.

The heyday for complete, turnkey enterprise software systems and the highwater mark for enterprise software budgets appear to have passed. Both customers and vendors are trying to bring more rationality and predictability into the IT software cost equation. The specific mix and nature of these changing models is still unclear.

Some Venture Implications

The major casualty from these trends is the idea of the enterprise "killer app" and its ability to become a virtual money printing press. The dominance of this myth can cause some significant mis-steps and misunderstandings in putting together a successful venture:

  • Waiting to get packaging and configuration right delays time to market and incurs higher development costs in the absence of supporting revenues. There is a need to get customer exposure and input earlier with less developed solutions.  Shattering the myth of the packaged software printing press for money is important to change attitudes and immediate priorities
  • VC support may be deferable with lower needs for upfront development dollars, and, in any case, venture support should shift from packaged "products" to interoperable and modular technologies 
  • As Eric von Hippel points out in his recent book, Democratizing Innovation [5], early and constant involvement of the customers and the market are keys to innovation and suggest business models that are more experimental and open source, and
  • It appears the days — at least for the foreseeable future — of the "killer app" are over in the enteprise setting. Companies and enterprises are demanding more accountability and justification for expenditures; software vendors are realizing that at the enterprise level "cookie cutter" approaches work relatively infrequently.

It is seductive to think that with the right packaging, the right interface, and the right combination of features and functionality that it then becomes possible to turn the crank on the money printing press.  After development and packaging are complete, after all, the cost of the next incremental unit for shipment is close to nil. But enteprises rarely can adopt commodity approaches to unique situations and problems.  Customization is the rule and the environment is never the same.

Understanding these secular trends is important for software entrepreneurs and the angels and VCs that may back them.  The common theme returns: choice of business model in response to market conditions is likely more important than technology or innovation.


[1]
A.M. Konary, S. Graham, and L.A. Seymour, The Future of Software Licensing: Software Licensing Under Siege, IDC White
Paper, International Data Corporation, March 2004, 21 pp.  See http://www.idc.com/groups/software_licensing/downloads/4046_rev6_idc_site.pdf (requires registration).

[2]
Culpepper & Associates, "Software Revenues Continue to Shift from Licenses to Services," September 10, 2002. See
http://www.culpepper.com/eBulletin/2002/SeptermberRatiosArticle.asp

[3]
M. Cusumano, "Business Models that Last: Balancing Products and Services in Software and Other Industries," MIT Sloan School of Management Working  Paper 197, December 2003, 22 pp. See  http://ebusiness.mit.edu/research/papers/197_Cusumano_ProdSrvcsBusMod.pdf

[4] Macrovision Corporation, Key Trends in Software Pricing and Licensing, White Paper for various clients, October 2004, 12 pp.  See http://www.siia.net/software/pubs/SW_Pricing_Licensing_Report.pdf.

[5] E. von Hippel, Democratizing Innovation, MIT Press, Cambridge, MA, 2005, 220 pp.  Electronic version available via Creative Commons license, see http://web.mit.edu/evhippel/www/democ.htm.

Posted:May 31, 2006

An incredibly fascinating visualization tool by Sala on the Aharef blog is called the Website as a Graph.  This posting links to the actual entry site where you can enter a Web address and the system provides a visual analysis of that individual Web page (not an overall view of the site).  The color coding applied is:

blue: for links (the A tag)
red: for tables (TABLE, TR and TD tags)
green: for the DIV tag
violet: for images (the IMG tag)
yellow: for forms (FORM, INPUT, TEXTAREA, SELECT and OPTION tags)
orange: for linebreaks and blockquotes (BR, P, and BLOCKQUOTE tags)
black: the HTML tag, the root node
gray: all other tags

Here is the figure that is created based on my blog site:

Here is the figure from the BrightPlanet Web site:

Here is the figure from a new BrightPlanet Web site design, not yet publicly released:

Here is the figure from the CompletePlanet Web site. Note it uses the Deep Query Manager Publisher module:

Here is the figure from BrightPlanet’s Web and graphics design firm, Paulsen Marketing Communications, which is also a founder of the company. PMC uses a Flash design that does not render well with the applet:

And, finally, here is the figure from the QueryHorse equine search portal, built with the DQM Publisher:

These graphs are mostly fun, and are the number one tag currently on the Flickr site as "websitesasgraphs".  You can see hundreds of examples there.  These graphs do indicate whether sites depend on tables or /div tags, use of images, complexity and the like.  But, mostly they are fun, and perhaps even art.

Posted by AI3's author, Mike Bergman Posted on May 31, 2006 at 10:52 am in Site-related | Comments (1)
The URI link reference to this post is: https://www.mkbergman.com/236/let-a-thousand-better-ten-flowers-bloom/
The URI to trackback this post is: https://www.mkbergman.com/236/let-a-thousand-better-ten-flowers-bloom/trackback/
Posted:May 25, 2006

It is truly amazing  — and very commonly overlooked  — to see how much progress has been made in the past decade to overcoming what had then been perceived as close-to intractable data interoperability and federation issues.

It is easy to forget the recent past.

In various stages and in various ways, computer scientists and forward-looking users have focused on many issues related to how to maximize the resources being poured into computer hardware and software and data collection and analysis over the past two decades. Twenty years ago, some of the buzz words were portability. data warehousing and microcomputers (personal computers). Ten years ago, some of those buzz words were client-server, networking and interoperability. Five years ago, the buzz words had shifted to dot-com and e-commerce and interoperability (now called ‘plug-and-play’). Today, among many, the buzz words could arguably include semantic Web, Web 2.0 and interoperability or mashups.

Of course, the choice of which buzz words to highlight is from the author’s perspective, and other buzz words could be argued as more important. That is not the point. Nor is the point that fads or buzz words come and go.

But changing buzz words and trends can indeed mask real underlying trends and progress. So the real point is this: Don’t blink, some real amazing progress has taken place overcoming data federation and interoperability in the last 15 to 20 years.

The ‘Data Federation’ Imperative

“Data federation”  — the important recognition that value could be unlocked by connecting information from multiple, separate data stores  — first became a research emphasis within the biology and computer science communities in the 1980s. It also gained visibility as “data warehousing” within enterprises by the early-90s. However, within that period, extreme diversity in physical hardware, operating systems, databases, software and immature networking protocols hampered the sharing of data. It is easy to overlook the massive strides in overcoming these prior obstacles in the past decade.

It is instructive to turn back the clock and think about what issues were preoccupying buyers, users and thinkers in IT twenty years ago. While the PC had come on the scene, with IBM opening the floodgates in 1982, there were mainframes from weird 36-bit Data General systems to DEC PDP minicomputers to the PCs themselves. Even on PCs, there were multiple operating systems, and many then claimed that CP/M was likely to be ascendant, let alone the upstart MS-DOS or the gorilla threat of OS/2 (in development). Hardware differences were all over the map, operating systems were a laundry list two pages long, and nothing worked with anything else. Computing in that era was an Island State.

So, computer scientists or users interested in “data federation” at that time needed to first look to issues at the iron or silicon or OS level. Those problems were pretty daunting, though clever folks behind Ethernet or Novell with PCs were about to show one route around the traffic jam.

Client-server and all of the “N-tier” speak soon followed, and it was sort of an era of progress but still costly and proprietary answers to get things to talk to one another. Yet there was beginning to emerge a rationality, at least at the enterprise level, for how to link resources together from the mainframe to the desktop. Computing in that era was the Nation-state.

But still, it was incredibly difficult to talk with other nations. And that is where the Internet, specifically the Web protocol and the Mozilla (then commercially Netscape) browser came in. Within five years (actually less) from 1994 the Internet took off like a rocket, doubling in size every 3-6 months.

Climbing the ‘Data Federation’ Pyramid

So, the view of the “data federation” challenge, as then articulated in different ways, looked like a huge, imposing pyramid 20 years ago:

Rapid Progress in Climbing the Data Federation Pyramid

It is truly amazing  — and very commonly overlooked  — to see how much progress has been made in the past decade to overcoming what had been perceived as close-to intractable data interoperability and federation issues a mere decade or two ago.

Data federation and resolving various heterogeneities has many of its intellectual roots in the intersection of biology and computer science. Issues of interoperability and data federation were particularly topical about a decade ago, in papers such as those from Markowitz and Ritter,[1] Benton,[2] and Davidson and Buneman.[3] [4] Interestingly, this very same community was also the most active in positing the importance (indeed, first defining) “semi-structured” data and innovating various interoperable data transfer protocols, including XML and its various progenitors and siblings.

These issues of data federation and data representation first arose and received serious computer science study in the late 1970s and early 1980s. In the early years of trying to find standards and conventions for representing semi-structured data (though not yet called that), the major emphasis was on data transfer protocols. In the financial realm, one standard dating from the late 1970s was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript, PDF (portable document format), and RTF (rich text format).

One of these proposed standards was the “standard generalized markup language” (SGML), first published in 1986. SGML was flexible enough to represent either formatting or data exchange. However, with its flexibility came complexity. Only when two simpler forms arose, namely HTML (HyperText Markup Language) for describing Web pages and (much later) XML (eXtensible Markup Language) for data exchange, did variants of the SGML form emerge as widely used common standards.[5]

The Internet Lops Off the Pyramid

Of course, midway into these data representation efforts was the shift to the Internet Age, blowing away many previous notions and limits. The Internet and its TCP/IP protocols and XML standards for “semi-structured” data and data transfer and representations, in particular, have been major contributors to overcoming respective physical and syntactical and data exchange heterogeneities, also shown by the data federation pyramid above.

The first recorded mentions of “semi-structured data” occurred in two academic papers from Quass et al.[6] and Tresch et al.[7] in 1995. However, the real popularization of the term “semi-strucutred data” occurred through the seminal 1997 papers from Abiteboul, “Querying semi-structured data,” [8] and Buneman, “Semistructured data.” [9]

One could thus argue that the emergence of the “semi-structured data” construct arose from the confluence of a number of factors:

  • The emergence of the Web
  • The desire for extremely flexible formats for data exchange between disparate databases (and therefore useful for data federation)
  • The usefulness of expressing structured data in a semi-structured way for the purposes of browsing, and
  • The growth of certain scientific databases, especially in biology (esp., ACeDB), where annotations, attribute extensibility resulting from new discoveries, or a broader mix of structural and text data was desired.[9]

Semi-structured data, as all other data structures, needs to be represented, transferred, stored, manipulated or analyzed, all possibly at scale and with efficiency. It is often easy to confuse data representation from data use and manipulation. XML provides an excellent starting basis for representing semi-structured data. But XML says little or nothing about these other challenges in semi-structured data use.

Thus, we see in the pyramid figure above that in rapid-fire order the Internet and the Web quickly overcame:

  • Federation challenges in hardware and OSes and network protocols; namely the entire platform and interconnection base to the pyramid and the heretofore daunting limitations to interoperability
  • A data representation protocol  — solved via XML  — that was originally designed for extensibility but became ubiquitous for a standard in data transfer
  • A shift in attention from the physical to the metaphysical.

Shifting from the Structure to the Meaning

With these nasty issues of data representation and interconnection now behind us, it is not surprising that today’s buzz has shifted to things like the semantic Web, interoperabiility, Web 2.0, “social” computing, and the like.

Thus, today’s challenge is to resolve differences in meaning, or semantics, between disparate data sources. For example, your ‘glad’ may be someone else’s ‘happy’ and you may organize the world into countries while others organize by regions or cultures.

Resolving semantic heterogeneities is also called semantic mediation or data mediation. Though it displays as a small portion of the pyramid above, resolving semantics is a complicated task and may involve structural conflicts (such as naming, generalization, aggregation), domain conflicts (such as schema or units), data conflicts (such as synonyms or missing values) or language differences (human and electronic encodings). Researchers have identified nearly 40 discrete possible types of semantic heterogeneities (this area is discussed in a later post).

Ontologies provide a means to define and describe these different “world views.” Referentially integral languages such as RDF (Resource Description Framework) and its schema implementation (RDF-S) or the Web ontological description language (OWL) are leading standards among other emerging ones for machine-readable means to communicate the semantics of data. These standards are being embraced by various communities of practice; today, for example, there are more than 15,000 OWL ontologies. Life sciences, physics, pharmaceuticals and the intelligence sector are notable leading communities.

The challenge of semantic mediation at scale thus requires recognition and adherence to the emerging RDF-S and OWL standards, plus an underlying data management foundation that can handle the subject-object-predicate triples basis of RDF.

Yet, as the pyramid shows, despite massive progress in scaling it, challenges remain even after the daunting ones in semantics. Matching alternative schema (or ontologies or “world views”) will require much in the nature of new rules and software. And, vexingly, at least for the open Internet environment, there will always the the issue of what data you can trust and with what authority.

These are topics for subsequent posts regarding the semantic Web ‘stack’ and challenges in resolving semantic heterogeneities.


[1] V.M. Markowitz and O. Ritter, “Characterizing Heterogeneous Molecular Biology Database Systems,” in Journal of Computational Biology 2(4): 547-546, 1995.

[2] D. Benton, “Integrated Access to Genomic and Other Bioinformation: An Essential Ingredient of the Drug Discovery Process,” in SAR and QSAR in Environmental Research 8: 121-155, 1998.

[3] S.B. Davidson, C. Overton, and P. Buneman, “Challenges in Integrating Biological Data Sources,” in Journal of Computational Biology 2(4): 557-572, 1995.

[4] S.B. Davidson, G.C. Overton, V. Tannen, and L. Wong, “BioKleisli: A Digital Library for Biomedical Researchers,” in International Journal on Digital Libraries 1: 36-53, 1997.

[5] A common distinction is to call HTML “human readable” while XML is “machine readable” data.

[6] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman and J. Widom, “Querying Semistructured Heterogeneous Information,” presented at Deductive and Object-Oriented Databases (DOOD ’95), LNCS, No. 1013, pp. 319-344, Springer, 1995.

[7] M. Tresch, N. Palmer, and A. Luniewski, “Type Classification of Semi-structured Data,” in Proceedings of the International Conference on Very Large Data Bases (VLDB), 1995.

[8] Serge Abiteboul, “Querying Semi-structured data,” in International Conference on Data Base Theory (ICDT), pp. 1-18, Delphi, Greece, 1997. See http://dbpubs.stanford.edu:8090/pub/1996-19.

[9] Peter Buneman, “Semistructured Data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117-121, Tucson, Arizona, May 1997. See http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz.

Posted by AI3's author, Mike Bergman Posted on May 25, 2006 at 12:07 am in Adaptive Information, Semantic Web | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/229/climbing-the-data-federation-pyramid/
The URI to trackback this post is: https://www.mkbergman.com/229/climbing-the-data-federation-pyramid/trackback/