An incredibly fascinating visualization tool by Sala on the Aharef blog is called the Website as a Graph. This posting links to the actual entry site where you can enter a Web address and the system provides a visual analysis of that individual Web page (not an overall view of the site). The color coding applied is:
blue: for links (the A tag)
red: for tables (TABLE, TR and TD tags)
green: for the DIV tag
violet: for images (the IMG tag)
yellow: for forms (FORM, INPUT, TEXTAREA, SELECT and OPTION tags)
orange: for linebreaks and blockquotes (BR, P, and BLOCKQUOTE tags)
black: the HTML tag, the root node
gray: all other tags
Here is the figure that is created based on my blog site:
Here is the figure from the BrightPlanet Web site:
Here is the figure from a new BrightPlanet Web site design, not yet publicly released:
Here is the figure from BrightPlanet’s Web and graphics design firm, Paulsen Marketing Communications, which is also a founder of the company. PMC uses a Flash design that does not render well with the applet:
And, finally, here is the figure from the QueryHorse equine search portal, built with the DQM Publisher:
These graphs are mostly fun, and are the number one tag currently on the Flickr site as "websitesasgraphs". You can see hundreds of examples there. These graphs do indicate whether sites depend on tables or /div tags, use of images, complexity and the like. But, mostly they are fun, and perhaps even art.
It is truly amazing — and very commonly overlooked — to see how much progress has been made in the past decade to overcoming what had then been perceived as close-to intractable data interoperability and federation issues.
It is easy to forget the recent past.
In various stages and in various ways, computer scientists and forward-looking users have focused on many issues related to how to maximize the resources being poured into computer hardware and software and data collection and analysis over the past two decades. Twenty years ago, some of the buzz words were portability. data warehousing and microcomputers (personal computers). Ten years ago, some of those buzz words were client-server, networking and interoperability. Five years ago, the buzz words had shifted to dot-com and e-commerce and interoperability (now called ‘plug-and-play’). Today, among many, the buzz words could arguably include semantic Web, Web 2.0 and interoperability or mashups.
Of course, the choice of which buzz words to highlight is from the author’s perspective, and other buzz words could be argued as more important. That is not the point. Nor is the point that fads or buzz words come and go.
But changing buzz words and trends can indeed mask real underlying trends and progress. So the real point is this: Don’t blink, some real amazing progress has taken place overcoming data federation and interoperability in the last 15 to 20 years.
The ‘Data Federation’ Imperative
“Data federation” — the important recognition that value could be unlocked by connecting information from multiple, separate data stores — first became a research emphasis within the biology and computer science communities in the 1980s. It also gained visibility as “data warehousing” within enterprises by the early-90s. However, within that period, extreme diversity in physical hardware, operating systems, databases, software and immature networking protocols hampered the sharing of data. It is easy to overlook the massive strides in overcoming these prior obstacles in the past decade.
It is instructive to turn back the clock and think about what issues were preoccupying buyers, users and thinkers in IT twenty years ago. While the PC had come on the scene, with IBM opening the floodgates in 1982, there were mainframes from weird 36-bit Data General systems to DEC PDP minicomputers to the PCs themselves. Even on PCs, there were multiple operating systems, and many then claimed that CP/M was likely to be ascendant, let alone the upstart MS-DOS or the gorilla threat of OS/2 (in development). Hardware differences were all over the map, operating systems were a laundry list two pages long, and nothing worked with anything else. Computing in that era was an Island State.
So, computer scientists or users interested in “data federation” at that time needed to first look to issues at the iron or silicon or OS level. Those problems were pretty daunting, though clever folks behind Ethernet or Novell with PCs were about to show one route around the traffic jam.
Client-server and all of the “N-tier” speak soon followed, and it was sort of an era of progress but still costly and proprietary answers to get things to talk to one another. Yet there was beginning to emerge a rationality, at least at the enterprise level, for how to link resources together from the mainframe to the desktop. Computing in that era was the Nation-state.
But still, it was incredibly difficult to talk with other nations. And that is where the Internet, specifically the Web protocol and the Mozilla (then commercially Netscape) browser came in. Within five years (actually less) from 1994 the Internet took off like a rocket, doubling in size every 3-6 months.
Climbing the ‘Data Federation’ Pyramid
So, the view of the “data federation” challenge, as then articulated in different ways, looked like a huge, imposing pyramid 20 years ago:
It is truly amazing — and very commonly overlooked — to see how much progress has been made in the past decade to overcoming what had been perceived as close-to intractable data interoperability and federation issues a mere decade or two ago.
Data federation and resolving various heterogeneities has many of its intellectual roots in the intersection of biology and computer science. Issues of interoperability and data federation were particularly topical about a decade ago, in papers such as those from Markowitz and Ritter, Benton, and Davidson and Buneman.  Interestingly, this very same community was also the most active in positing the importance (indeed, first defining) “semi-structured” data and innovating various interoperable data transfer protocols, including XML and its various progenitors and siblings.
These issues of data federation and data representation first arose and received serious computer science study in the late 1970s and early 1980s. In the early years of trying to find standards and conventions for representing semi-structured data (though not yet called that), the major emphasis was on data transfer protocols. In the financial realm, one standard dating from the late 1970s was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript, PDF (portable document format), and RTF (rich text format).
One of these proposed standards was the “standard generalized markup language” (SGML), first published in 1986. SGML was flexible enough to represent either formatting or data exchange. However, with its flexibility came complexity. Only when two simpler forms arose, namely HTML (HyperText Markup Language) for describing Web pages and (much later) XML (eXtensible Markup Language) for data exchange, did variants of the SGML form emerge as widely used common standards.
The Internet Lops Off the Pyramid
Of course, midway into these data representation efforts was the shift to the Internet Age, blowing away many previous notions and limits. The Internet and its TCP/IP protocols and XML standards for “semi-structured” data and data transfer and representations, in particular, have been major contributors to overcoming respective physical and syntactical and data exchange heterogeneities, also shown by the data federation pyramid above.
The first recorded mentions of “semi-structured data” occurred in two academic papers from Quass et al. and Tresch et al. in 1995. However, the real popularization of the term “semi-strucutred data” occurred through the seminal 1997 papers from Abiteboul, “Querying semi-structured data,”  and Buneman, “Semistructured data.” 
One could thus argue that the emergence of the “semi-structured data” construct arose from the confluence of a number of factors:
Semi-structured data, as all other data structures, needs to be represented, transferred, stored, manipulated or analyzed, all possibly at scale and with efficiency. It is often easy to confuse data representation from data use and manipulation. XML provides an excellent starting basis for representing semi-structured data. But XML says little or nothing about these other challenges in semi-structured data use.
Thus, we see in the pyramid figure above that in rapid-fire order the Internet and the Web quickly overcame:
Shifting from the Structure to the Meaning
With these nasty issues of data representation and interconnection now behind us, it is not surprising that today’s buzz has shifted to things like the semantic Web, interoperabiility, Web 2.0, “social” computing, and the like.
Thus, today’s challenge is to resolve differences in meaning, or semantics, between disparate data sources. For example, your ‘glad’ may be someone else’s ‘happy’ and you may organize the world into countries while others organize by regions or cultures.
Resolving semantic heterogeneities is also called semantic mediation or data mediation. Though it displays as a small portion of the pyramid above, resolving semantics is a complicated task and may involve structural conflicts (such as naming, generalization, aggregation), domain conflicts (such as schema or units), data conflicts (such as synonyms or missing values) or language differences (human and electronic encodings). Researchers have identified nearly 40 discrete possible types of semantic heterogeneities (this area is discussed in a later post).
Ontologies provide a means to define and describe these different “world views.” Referentially integral languages such as RDF (Resource Description Framework) and its schema implementation (RDF-S) or the Web ontological description language (OWL) are leading standards among other emerging ones for machine-readable means to communicate the semantics of data. These standards are being embraced by various communities of practice; today, for example, there are more than 15,000 OWL ontologies. Life sciences, physics, pharmaceuticals and the intelligence sector are notable leading communities.
The challenge of semantic mediation at scale thus requires recognition and adherence to the emerging RDF-S and OWL standards, plus an underlying data management foundation that can handle the subject-object-predicate triples basis of RDF.
Yet, as the pyramid shows, despite massive progress in scaling it, challenges remain even after the daunting ones in semantics. Matching alternative schema (or ontologies or “world views”) will require much in the nature of new rules and software. And, vexingly, at least for the open Internet environment, there will always the the issue of what data you can trust and with what authority.
 D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman and J. Widom, “Querying Semistructured Heterogeneous Information,” presented at Deductive and Object-Oriented Databases (DOOD ’95), LNCS, No. 1013, pp. 319-344, Springer, 1995.
 Serge Abiteboul, “Querying Semi-structured data,” in International Conference on Data Base Theory (ICDT), pp. 1-18, Delphi, Greece, 1997. See http://dbpubs.stanford.edu:8090/pub/1996-19.
 Peter Buneman, “Semistructured Data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117-121, Tucson, Arizona, May 1997. See http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz.
Katie Portwin, one of the Ingenta developers whose Jena paper stimulated my recent posting on semantic Web scalability, has expanded on the scalability theme in interesting ways in her recent performance, triplestores, and going round in circles.. post.
In her post, Katie asks rhetorically, Can industrial scale triplestores be made to perform? Is breaking the "triple table" model the answer? She then goes on to note that in a related XTech paper, the Ingenta team showed that even a simple, bread and butter sample query takes 1.5 seconds on a 200 million triple-store. The post also contains interesting links to other speakers at the Jena User’s Conference last week, including clever ways to cluster triples in an RDBMS.
I asked Tom Tiahrt, BrightPlanet’s chief scientist and lead developer on our text and semantic engines, to review this post and give me his thoughts. Here are his comments:
I always like to see this: "re-modelling" or "modelling" instead of "modeling" because I abhor human-induced language entropy. Kudos to Katie Portwin (KP) for that alone.
Kevin Wilkinson (KW) defines a triple store as a three-column table in a relational system. This is unfortunate because a triple-store is not exclusive to RDB systems. It must be provided by any RDF system as part of its logical design, even if does not use it for its physical design.
KW's patterns-identification aspect is likely true in many instances, and his 'breaking' the clean RDF format is what DBAs and RDB developers always do to improve performance (denormalizing the database). KP points out the problem with this, viz., that you must maintain a more complex schema, and duplicates raise data retrieval issues (though they are tractable). Moreover, KP writes "The great thing about the triplestore is that we don’t have to bake assumptions about the data into the database – we can have as many whatevers as we like."
The point is to achieve acceptable performance you cannot simply rely on the triple store alone. At the same time, RDF requires triples, and to prevent assumption baking the user should not have to decide how to denormalize the triple-store. In addition,
the transitive closure computation is the onerous query that the RDBMS cannot do within a reasonable amount of time.
Here are the parameters of the great problem. Static assumptions about what will happen directly oppose what RDF is supposed to provide. Open-ended dynamic processing cannot perform well enough to solve the problem.
Katie Portwin also points out that re-modelling is a real problem as well when the system is hosted by an RDBMS, though the triple stores can remain intact.
I’ll keep monitoring this topic and post other interesting perspectives on RDF, triple-store and semantic Web scalability as I encounter them.
Three related shifts in data use and management are intersecting to create a unique market opportunity. This opportunity represents a generational change from an era of structured data in stove-piped systems managed by relational data systems to one of semi-structured data in hugely scaled and interconnected and interoperable networks. How this next-generation system is to be managed is still unclear, the answer to which represents the major market opportunity.
The specific shifts driving this change are:
And, of course, all of this is occurring within the context of explosive data access and growth, literally at Internet scales.
This opportunity is first presenting itself through the leadership and support of the federal intelligence, defense and homeland security agencies. Certain industries, notably financial services and pharmaceuticals, are the next likely to show enterprise applicability, and specific academic domains, particularly in biology, are also innovating this trail. The emerging leaders in this nascent market opportunity will likely be determined within the next two to three years.
I’ve apparently gotten into online video feeds and other supplements to the standard written word. For an interesting series of interviews with on-the-edge entrepreneurs, I recommend the Venture Voice podcasts.