| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
| « Feb | ||||||
| 1 | 2 | |||||
| 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| 31 | ||||||
Nova Spivack has announced the imminent release of a semantic Web tool for the desktop, Open Iris:
Following in the footsteps of Douglas Engelbart’s pioneering work, SRI has announced the upcoming open-source (LGPL) release of Open IRIS — an experimental Semantic Web personal information manager that runs on the desktop. IRIS was developed for the DARPA CALO project and makes use of code libraries and ontology components developed at SRI, and my own startup, Radar Networks, as well as other participating research organizations.
IRIS is designed to help users make better sense of their information. It can run on it’s own, or can be connected to the CALO system which provides advanced machine learning capabilities to it. I am very proud to see IRIS go open source — I think it has potential to become a major platform for learning applications on the desktop.
IRIS is still in its early stages of evolution, and much work will be done this year to add further functionality, improve the GUI and make IRIS even more user-friendly. But already it is perhaps the most sophisticated and comprehensive semantic desktop PIM ever created. If you would like to read more about IRIS, this paper provides a good overview.
The first recorded mentions of “semi-structured data” occurred in two academic papers from Quass et al.[1] and Tresch et al.[2] in 1995. However, the real popularization of the term “semi-strucutred data” occurred through the seminal 1997 papers from Abiteboul, “Querying semi-structured data,” [3] and Buneman, “Semistructured data.” [4] Of course, semi-structured data had existed well before this time, only it had not been named as such.
What is Semi-structured Data?
Peter Wood, a professor of computer science at Birkbeck College at the University of London, provides succinct definitions of the “structure” of various types of data:[5]
Unlike structured or unstructured data, there is no accepted database engine specific to semi-structured data. Some systems attempt to use relational DBMS approaches from the structured end of the spectrum; other systems attempt to add some structure to standard unstructured search engines. (This topic is discussed in a later section.)

Semi-structured data models are sometimes called “self-describing” (or schema-less). These data models are often represented as labelled graphs, or sometimes labelled trees with the data stored at the leaves. The schema information is contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.
A nice description by David Loshin[6] on Simple Semi-structured Data notes that structured data can be easily modeled, organized, formed and formatted in ways that are easy for us to manipulate and manage. In contrast, though we are all familiar with the unstructured text in documents, such as articles, slide presentations or the message components of emails, its lack of structure prevents the advantages of structured data management. Loshin goes on to describe the intermediate nature of semi-structured data:
There [are] sets of data in which there is some implicit structure that is generally followed, but not enough of a regular structure to “qualify” for the kinds of management and automation usually applied to structured data. We are bombarded by semi-structured data on a daily basis, both in technical and non-technical environments. For example, web pages follow certain typical forms, and content embedded within HTML often have some degree of metadata within the tags. This automatically implies certain details about the data being presented. A non-technical example would be traffic signs posted along highways. While different areas use their own local protocols, you will probably figure out which exit is yours after reviewing a few highway signs.
This is what makes semi-structured data interesting–while there is no strict formatting rule, there is enough regularity that some interesting information can be extracted. Often, the interesting knowledge involves entity identification and entity relationships. For example, consider this piece of semi-structured text (adapted from a real example):
John A. Smith of Salem, MA died Friday at Deaconess Medical Center in Boston after a bout with cancer. He was 67.
Born in Revere, he was raised and educated in Salem, MA. He was a member of St. Mary’s Church in Salem, MA, and is survived by his wife, Jane N., and two children, John A., Jr., and Lily C., both of Winchester, MA.
A memorial service will be held at 10:00 AM at St. Mary’s Church in Salem.
This death notice contains a great deal of information–names of people, names of places, relationships between people, affiliations between people and places, affiliations between people and organizations and timing of events related to those people. Realize that not only is this death notice much like others from the same newspaper, but that it is reasonably similar to death notices in any newspaper in the US.
Note in Loshin’s example that the “structure” added to the unstructured text (shown in yellow; my emphasis) to make this “semi-structured” data arises from adding informational attributes that further elaborate or describe the document. These attributes can be automatically found using “entity extraction” tools or similar information extraction (IE) techniques, or manually identified. [7] These attributes can be assigned to pre-defined record types for manipulation separate from a full-text seach of the document text. Generally, when such attributes are added to the core unstructured data it is done through “metatags” that a parser can structurally recognize, such as by using the common open and close angle brackets. For example:
<author=John Smith>
In semi-structured HTML, the tags that provide the semi-structure serve a different purpose in terms of either formatting instructions to a browser or providing reference links to internal anchors or external documents or pages. Note that HTML also uses the open and close angle brackets as the convention to convey the structural information in the document.
The Birth of the Semi-structured Data Construct
One could argue that the emergence of the “semi-structured data” construct arose from the confluence of a number of factors:
These issues first arose and received serious computer science study in the late 1970s and early 1980s. In the early years of trying to find standards and conventions for representing semi-structured data (though not yet called that), the major emphasis was on data transfer protocols.
In the financial realm, one proposed standard was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript, PDF (portable document format), RTF (rich text format), and the like.
One of these proposed standards was the “standard generalized markup language” (SGML), first published in 1986. SGML was flexible enough to represent either formatting or data exchange. However, with its flexibility came complexity. Only when two simpler forms arose, namely HTML (HyperText Markup Language) for describing Web pages and XML (eXtensible Markup Language) for data exchange, did variants of the SGML form emerge as widely used common standards.[9]
The XML standard was first published by the W3C in February 1998, rather late in this history and after the semi-structured data term had achieved some impact.[10] Dan Suciu was the first to publish on the linkage of XML to semi-structured data in 1998,[11] a reference that remains worth reading to this day.
In addition, the OEM (Object Exchange Model) has become the de facto model for semi-structured data. OEM is a graph-based, self-describing object instance model. It was originally introduced for the Tsimmis data integration project,[12] and provides the intellectual basis for object representation in a graph structure with objects either being atomic or complex.
How the attribute “metadata” is described and associated has itself been the attention of much standards work. Truly hundreds of description standards have been proposed from specific instances in medical terminology such as MESH to law to physics to engineering and to cross-discipline proposed standards such as the Dublin core. (Google these for a myriad of references.)
Challenges in Semi-structured Data
Semi-structured data, as for all other data structures, needs to be represented, transferred, stored, manipulated or analyzed, all possibly at scale and with efficiency. It is often easy to confuse data representation from data use and manipulation. XML provides an excellent starting basis for representing semi-structured data. But XML says little or nothing about these other challenges in semi-structured data use:
Generally, most academic, open source, or other attention to these problems has been at the superficial level of resolving schema or definitions or units. Totally lacking in the entire thrust for a semi-structured data paradigm has been the creation of adequate processing engines for effiicient and scalable storage and retrieval of semi-structured data. [13]
You know, it is very strange. Tremendous effort goes into data representations like XML, but when it comes to positing or designing engines for manipulating that data the approach is to clone kludgy workarounds on to existing relational DBMSs or text search engines. Neither meet the test.
Thus, as the semantic Web and its association to semi-structured data looks forward, two impediments stand like gatekeepers blocking forward progress: 1) efficient processing engines and 2) scalable systems and architectures.
[2] M. Tresch, N. Palmer, and A. Luniewski, “Type Classification of Semi-structured Data,” in Proceedings of the International Conference on Very Large Data Bases (VLDB), 1995.
[3] Serge Abiteboul, “Querying Semi-structured data,” in International Conference on Data Base Theory (ICDT), pp. 1–18, Delphi, Greece, 1997. See http://dbpubs.stanford.edu:8090/pub/1996-19.
[4] Peter Buneman, “Semistructured Data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117–121, Tucson, Arizona, May 1997. See http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz
[5] Peter Wood, School of Computer Science and Information Systems, Birkbeck College, the University of London. See http://www.dcs.bbk.ac.uk/~ptw/teaching/ssd/toc.html.
[6] David Loshin, “Simple Semi-structured Data,” Business Intelligence Network, October 17, 2005. See http://www.b-eye-network.com/view/1761.
[7] This example is actually quite complex and demonstrates the challenges facing “entity extraction” software. Extracted entities most often relate to the nouns or “things” within a document. Note also, for example, how many of the entities involve internal “co-referencing,” or the relation of subjects such as “he” to times such as “10 a.m” to specific dates. A good entity extraction engine helps resolve these so-called “within document co-references.”
[8] Peter Buneman, “Semistructured Data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117–121, Tucson, Arizona, May 1997. See http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz
[9] A common distinction is to call HTML “human readable” while XML is “machine readable” data.
[10] W3C, XML Development History. See http://www.w3.org/XML/hist2002.
[11] Dan Suciu, “Semistructured Data and XML,” in International Conference on Foundations of Data Organization (FODO), Kobe, Japan, November 1998. See PDF option from http://citeseer.ist.psu.edu/suciu98semistructured.html.
[12] Y. Papakonstantinou, H. Garcia-Molina and J. Widom, “Object Exchange Across Heterogeneous Information Sources,” in IEEE International Conference on Data Engineering, pp. 251-260, March 1995.
[13] Matteo Magnani and Danilo Montesi, “A Unified Approach to Structured, Semistructured and Unstructured Data,” Technical Report UBLCS-2004-9, Department of Computer Science, University of Bologna, 29 pp., May 29, 2004. See http://www.cs.unibo.it/pub/TR/UBLCS/2004/2004-09.pdf.
| NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semi-structured Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series. Stay tuned! |
Naveen Balani has recently published a good intro primer on the Semantic Web through IBM’s Developer Works entitled, "The Future of the Web is Semantic". Highly recommended.
Also, a recent paper on information retrieval and the semantic web was selected as the best paper in the 2005 ICSS mini-track on The Semantic Web: The Goal of Web Intelligence. The paper, by Tim Finin, James Mayfield, Clay Fink, Anupam Joshi, and R. Scott Cost, Information Retrieval and the Semantic Web, Proceedings of the 38th International Conference on System Sciences, January 2005, is also available as a PDF download. I recommend this to anyone interested in the application of Semantic Web concepts to traditional Internet search engines.
From Broadcast Newsroom, BBN Technologies just released version 2.0 of its AVOKE STX speech-to-text software. According to BBN, this new version improves the relevance of multimedia search results by transforming audio into searchable text with unprecedented accuracy. Applications include enterprise search, business and government intelligence, consumer search, audio mining, video search, broadcast monitoring, and multimedia asset management.
BBN says AVOKE STX 2.0 separates speech from non-speech, such as music or laughter, and then processes the speech to identify additional characteristics. This information is captured, tagged with metadata, and indexed in an XML format for use by standard search engines or technology. Because each word in the metadata is time-stamped, users can navigate easily to any point in the transcript, listen to the original audio, or watch the corresponding video.
BBN’s legacy extends to playing a key role in pioneering the development of the ARPANET, the forerunner of the Internet. BBN supports both commercial and government clients. Its AVOKE speech technology translates Arabic and Chinese with additional foreign languages planned.
I’ve just finished reading a fascinating 228 pp transcript on the topic of peer production and open data architectures. This discussion, the first of the so-called Union Square Sessions, involved more than 40 prominent Web and other thinkers with a heavy sprinkling of VCs.
That being said, I was disappointed that neither interoperability nor extensibility directly entered into any of the discussions. I suspect this may be due to conjoining the important singular topic of open data architectures with the lens of peer production or social networks. For me, the quote closest to my interests among this disparate group was from Dick Costolo, who stated “the bottom line implication is that an open data architecture will be one that is purely API based and not destination based.”
Nonetheless, this is an interesting start. I’d like to humbly suggest open and extensible data architectures (including, importantly, database engines in addition to extensible exchange formats such as XML) for a future discussion topic.
Here is the link to the Union Square Sessions Transcript.