<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>AI3:::Adaptive Information &#187; Deep Web</title>
	<atom:link href="http://www.mkbergman.com/category/deep-web/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mkbergman.com</link>
	<description>Mike Bergman on the semantic Web and structured Web</description>
	<lastBuildDate>Mon, 26 Jul 2010 05:31:20 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Massive Muscle on the ABox at Google</title>
		<link>http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/</link>
		<comments>http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/#comments</comments>
		<pubDate>Fri, 27 Mar 2009 16:08:08 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Adaptive Information]]></category>
		<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Description Logics]]></category>
		<category><![CDATA[Searching]]></category>
		<category><![CDATA[Structured Web]]></category>
		<category><![CDATA[ABox]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[information extraction]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=481</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Massive Muscle on the ABox at Google&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Description Logics&amp;rft.subject=Searching&amp;rft.subject=Structured Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2009-03-27&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/&amp;rft.language=English"></span>

The Recent &#8216;The Unreasonable Effectiveness of Data&#8216; Provides Important Hints
To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results.  This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Massive Muscle on the ABox at Google&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Description Logics&amp;rft.subject=Searching&amp;rft.subject=Structured Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2009-03-27&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/&amp;rft.language=English"></span>
<p><img style="width: 200px; height: 70px; float: left;" title="Google logo" src="../wp-content/themes/ai3/images/2009Posts/090327_google_logo200.png" alt="Google logo" /></p>
<h2>The Recent &#8216;<span style="font-style: italic">The Unreasonable Effectiveness of Data</span>&#8216; Provides Important Hints</h2>
<p>To even the most casual Web searcher, it must now be evident that <a href="http://www.google.com/">Google</a> is constantly introducing new structure into its search results.  This past week three world-class computer scientists, all now research directors or scientists at Google, <a href="http://www.cs.washington.edu/homes/alon/">Alon Halevy</a>, <a href="http://norvig.com/">Peter Norvig</a> and <a href="http://www.cis.upenn.edu/~pereira/">Fernando Pereira</a>, published an opinion piece in the March/April 2009 issue of <a style="font-style: italic" href="http://www2.computer.org/portal/web/csdl/abs/mags/ex/2009/02/mex200902toc.htm">IEEE Intelligent Systems</a> titled, <a href="http://www.computer.org/portal/cms_docs_intelligent/intelligent/homepage/2009/x2exp.pdf">&#8216;The Unreasonable Effectiveness of Data.&#8217;</a> It provides important framing and hints for what next may emerge in semantics from the Google search engine.</p>
<p>I had earlier covered <a href="../?p=436">Halevy and Google&#8217;s work</a> on the <a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>.  In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.</p>
<div>
<div class="boxGreenDotted" style="padding: 10px; width: 520px; text-align: center;"><big><span style="font-style: italic">&#8220;Unfortunately, the fact that the word &#8217;semantic&#8217; appears in both &#8216;Semantic Web&#8217; and &#8217;semantic interpretation&#8217; means that the two problems have often been conflated, causing needless and endless consternation and confusion. The &#8217;semantics&#8217; in Semantic Web services is embodied in the code that implements those services in accordance with the specifications expressed by the relevant ontologies and attached informal documentation.&#8221;<span class="double_u"> </span></span></big></div>
</div>
<p>Some of the research they cite is related to WebTables<a href="#goog1"> [1]</a> and similar efforts to extract structure from Web-scale data.  The authors describe the use of such systems to create &#8217;schemata&#8217; of attributes related to various types of instance records &#8212; in essence, figuring out the structure of ABoxes <a href="#goog2">[2]</a>, for leading instance types such as companies or automobiles <a href="#goog3">[3]</a>.</p>
<p>These observations, which they call the <em>semantic interpretation problem</em> and contrast with the <em>Semantic Web</em>, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis:  &#8220;Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.&#8221;</p>
<p>Google had earlier posted their <a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13">1 terabyte database of n-grams</a>, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages.  The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances.  (Good news for the rest of us!)</p>
<p>As the authors challenge:</p>
<ol>
<li>Choose a representation that can use unsupervised learning on unlabeled data</li>
<li>Represent the data with a non-parametric model, and</li>
<li>Trust the important concepts will emerge from this analysis because human language has already evolved words for it.</li>
</ol>
<p>My very strong suspicion is that we will see &#8212; and quickly &#8212; much more structured data for instance types (the &#8216;ABox&#8217;) rapidly emerge from Google in the coming weeks.  They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.</p>
<p>The structured Web is growing all around us like stalagmites in a cave!</p>
<hr style="margin: 15px 0px" size="1" />
<div style="margin: 10px 0pt; font-size: 90%"><a title="goog1" name="goog1"></a>[1] Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu and Yang Zhang, 2008. &#8220;WebTables: Exploring the Power of Tables on the Web,&#8221; in the <span style="font-style: italic">34th International Conference on Very Large Databases (VLDB)</span>, Auckland, New Zealand, 2008.  See <a href="http://web.mit.edu/y_z/www/papers/webtables-vldb08.pdf">http://web.mit.edu/y_z/www/papers/webtables-vldb08.pdf</a>.</div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="goog2" name="goog2"></a>[2] As per our standard use:</p>
<div class="boxGrayDotted">&quot;<a href="http://en.wikipedia.org/wiki/Description_logics">Description logics</a> and their semantics traditionally split <span style="font-style: italic">concepts</span> and their relationships from the different treatment of <span style="font-style: italic">instances</span> and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for <em>terminological</em> knowledge, the basis for <span style="font-style: italic">T</span> in <span style="font-style: italic">TBox</span>) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for <span style="font-style: italic">assertions</span>, the basis for <span style="font-style: italic">A</span> in <span style="font-style: italic">ABox</span>) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.&quot;</div>
</div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="goog3" name="goog3"></a>[3] I very much like the authors&#8217; use of &#8217;schemata&#8217; as the way to describe the attribute structure of various instance record types for the ABox, in contrast to the more appropriate &#8216;ontology&#8217; applied to the TBox.</div>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Multi-part Federated Search Interview</title>
		<link>http://www.mkbergman.com/467/multi-part-federated-search-interview/</link>
		<comments>http://www.mkbergman.com/467/multi-part-federated-search-interview/#comments</comments>
		<pubDate>Fri, 14 Nov 2008 22:32:37 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Adaptive Information]]></category>
		<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Linked Data]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Structured Web]]></category>
		<category><![CDATA[UMBEL]]></category>
		<category><![CDATA[BrightPlanet]]></category>
		<category><![CDATA[federated search]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[zitgist]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=467</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Multi-part <em>Federated Search</em> Interview&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Linked Data&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Web&amp;rft.subject=UMBEL&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-11-14&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/467/multi-part-federated-search-interview/&amp;rft.language=English"></span>
Topics Range from the Deep Web to Semantic Web in this Search Luminaries Series
I&#8217;m pleased to wrap up a multi-part interview with the Federated Search Blog as part of their ongoing &#8216;Search Luminaries&#8217; series. Sol Lederman, editor of the blog, does a thorough and comprehensive job!  Over the past month on every Friday, I [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Multi-part <em>Federated Search</em> Interview&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Linked Data&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Web&amp;rft.subject=UMBEL&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-11-14&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/467/multi-part-federated-search-interview/&amp;rft.language=English"></span>
<h2>Topics Range from the Deep Web to Semantic Web in this <span style="font-style: italic">Search Luminaries</span> Series</h2>
<p>I&#8217;m pleased to wrap up a multi-part interview with the <a href="http://federatedsearchblog.com/" style="font-style: italic">Federated Search Blog</a> as part of their ongoing &#8216;Search Luminaries&#8217; series. <a href="http://federatedsearchblog.com/about/">Sol Lederman</a>, editor of the blog, does a thorough and comprehensive job!  Over the past month on every Friday, I have answered some 25 or so of his detailed questions.</p>
<p>Federated Search Blog was particularly interested in the <a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>, its discovery and size.  Many of the early questions deal with those themes.  However, by <a href="http://federatedsearchblog.com/2008/11/14/michael-bergman-federated-search-luminary-part-iv/">Part 4</a> things get a bit more current, with the topics shifting to the <a href="http://en.wikipedia.org/wiki/Semantic_web">semantic Web</a>, <a href="http://en.wikipedia.org/wiki/Linked_Data">linked data</a> and <a href="http://www.zitgist.com/">Zitgist</a>.</p>
<p>Here are the links to the series:</p>
<ul>
<li><a href="http://federatedsearchblog.com/2008/10/17/luminary-interview-with-michael-bergman-a-preview/">Preview</a> (Oct. 17, 2008)</li>
<li><a href="http://federatedsearchblog.com/2008/10/24/michael-bergman-federated-search-luminary-part-i/">Part I</a> (Oct. 24)</li>
<li><a href="http://federatedsearchblog.com/2008/10/31/michael-bergman-federated-search-luminary-part-ii/">Part 2</a> (Oct. 31)</li>
<li><a href="http://federatedsearchblog.com/2008/11/07/michael-bergman-federated-search-luminary-part-iii/">Part 3</a> (Nov. 7)</li>
<li><a href="http://federatedsearchblog.com/2008/11/14/michael-bergman-federated-search-luminary-part-iv/">Part 4</a> (Nov. 14).</li>
</ul>
<p>To give you a flavor of the interview, here is an example of one of the questions (and probably my favorite):</p>
<p style="font-style: italic"><span style="font-weight: bold">20.</span> Tim Berners-Lee, credited with inventing the World Wide Web, has been talking about the importance and value of the Semantic Web for years yet common folks don&apos;t see much evidence of the Semantic Web gaining traction. Is there substance to the Semantic Web? What&apos;s happening with it now and what does its future look like?</p>
<div class="boxGrayDotted">Wow, in 10,000 words or less?</p>
<p>No, actually, this is a very good question. As things go, I am a relative newbie to the semantic Web, only having studied and followed it closely since about 2005. I&apos;m sure my perspective in coming later to the party may not be shared by those at the beginning, which dates to the mid-1990s as Berners-Lee&apos;s vision naturally progressed from a Web of documents, as most of us currently know the Web, to a Web of data.</p>
<p>I think there is indeed incredibly important substance to the semantic Web. But, as I have written elsewhere, the semantic Web is more of a vision than a discernable point in time or a milestone.</p>
<p>The basic idea of the semantic Web is to shift the focus from documents to data. Give data a unique Web address. Characterize that data with rich metadata. Describe how things are related to one another so that relationships and connections can be traced. Provide defined structures for what these things and relationships &quot;mean&quot;; this is what provides the semantics, with the structures and their defined vocabularies known as &quot;ontologies&quot; (which in one analog can be seen as akin to a relational database schema).</p>
<p>As these structures and definitions get put in place, the Web itself then becomes the infrastructure for relating information from everywhere and anywhere on any given topic or subject. While this vision may sound grandiose, just think back to what the Web itself has done for us and documents over the past decade or so. This same architecture and infrastructure can and should be extended to the actual information in those documents, the data. And, oh, by the way, conventional databases can now join this party as well. The vision is very powerful and very cool.</p>
<p>Progress has indeed been slow. Many advocates fairly point to how long it takes to get standards in place and for a while people spoke of the &quot;chicken-and-egg&quot; problem of getting over the threshold of having enough structured data to consume to make it worthwhile to create the tools and applications and showcases that consume that data.</p>
<p>From my perspective, the early visions of the semantic Web were too abstract, a bit off perhaps. First, there was the whole idea of artificial intelligence and machines using the data as opposed to better ways for humans to draw use from the data at hand. The fundamental and exciting engine underneath the semantic Web &#8212; the RDF (Resource Description Framework) data model &#8212; was not initially treated on its own. It got admixed with XML that made understanding difficult and distinctions vague. There is and remains too much academia and not enough pragmatics driving the bus.</p>
<p>But that is changing and fast.</p>
<p>There is now an immediate and practical &quot;flavor&quot; of the semantic Web called linked data. It has three simple bases:</p>
<p>(1) RDF as the simple but adaptable data model that can represent any information &#8212; structured or unstructured &#8212; as the basic &quot;triple&quot; statement of subject-predicate-object. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or the ball is round. It sounds like a kindergartner reader, but that is how data can be easily represented and built up into more complex structures and stories</p>
<p>(2) Give all objects a unique Web identifier. Unique identifiers are common to any database; in linked data, we just make sure those identifiers conform to the same URIs we see constantly in the address bar of our Web browsers, and:</p>
<p>(3) Post and expose this stuff as accessible on the Web (namely, HTTP).</p>
<p>My company adds some essential &quot;spice&quot; to these flavors with respect to reference structures and concepts to give the information context, but these simple bases remain the foundation.</p>
<p>These are really not complex steps. They are really no different than the early phases of posting documents on the Web. Only now, we are exposing data.</p>
<p>More importantly, we can forget the chicken-and-egg problem. Each new data link we make brings value, in the similar way that adding a node to a network brings value according to Metcalfe&apos;s Law. Only with linked data, we already have the nodes &#8212; the data &#8212; we are just establishing the link connections (the verbs, predicates or relations) to flesh out the network graph. Same principle, only our focus is now to connect what is there rather than to add more nodes. (Of course, adding more linked nodes helps as well!)</p>
<p>The absolutely amazing thing about our current circumstance as Web users is that we truly now have simple and readily deployable mechanisms available to finally overcome the decades of enterprise stovepipes. The whole answer is so simple it can be mistaken as snake oil when first presented and not inspected a bit.</p>
<p>As an industry accustomed to hype and cynical about so much of this, I only ask that your readers check out these assertions for themselves and suspend their normal and expected disbelief. For me, in a career of more than 30 years focusing on information and access, I feel like we finally now have the tools, data model and architecture at hand to actually achieve data interoperability.</p></div>
<p>Thanks again to Sol and <span style="font-style: italic">Federated Search Blog</span> for this opportunity.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/467/multi-part-federated-search-interview/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Currents in the &#8216;Deep Web&#8217;</title>
		<link>http://www.mkbergman.com/458/new-currents-in-the-deep-web/</link>
		<comments>http://www.mkbergman.com/458/new-currents-in-the-deep-web/#comments</comments>
		<pubDate>Fri, 10 Oct 2008 00:47:38 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[Structured Web]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=458</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=New Currents in the &#8216;Deep Web&#8217;&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Semantic Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-10-09&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/458/new-currents-in-the-deep-web/&amp;rft.language=English"></span>

Timelines, Semantics and Ontologies are Coming to the Fore
The past two weeks have seen an interesting emergence of new perspectives on the &#8216;deep Web&#8216;. The deep Web, a term Thane Paulsen and I coined for my oft-quoted study from 2000,  The Deep Web:  Surfacing Hidden Value [1], is the phenomenon of database-backed content [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=New Currents in the &#8216;Deep Web&#8217;&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Semantic Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-10-09&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/458/new-currents-in-the-deep-web/&amp;rft.language=English"></span>
<p><a href="../wp-content/themes/ai3/images/2008Posts/081009_DeepWebNew.jpg"><img src="../wp-content/themes/ai3/images/2008Posts/081009_DeepWebNew.jpg" style="border: 0px solid ; margin-right: 10px; width: 280px; height: 220px; float: left" alt="Trawling the Deep Web" vspace="5" hspace="5" /></a></p>
<h2>Timelines, Semantics and Ontologies are Coming to the Fore</h2>
<p>The past two weeks have seen an interesting emergence of new perspectives on the &#8216;<a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>&#8216;. The deep Web, a term <a href="http://www.paulsenagribranding.com/main.cfm">Thane Paulsen</a> and I coined for my oft-quoted study from 2000,  <a href="http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104" style="font-style: italic">The Deep Web:  Surfacing Hidden Value</a> <a href="#deep1">[1]</a>, is the phenomenon of database-backed content served from interactive Web search  forms.</p>
<p>Because deep Web content is dynamic and produced only on request, it has been difficult for traditional search  engines to index. It is also huge and of high quality (though likely not the 100x to 500x figure larger than the  standard &#8217;surface&#8217; Web that I used in that first study.)</p>
<h3>Deep Web Timeline</h3>
<p>This is the most recent of the three notable events over the past two weeks, and came out on Tuesday. Maureen  Flynn-Burhoe of the <a href="http://papergirls.wordpress.com/">oceanflynn @ Digg</a> blog has produced a very  informative and comprehensive <a href="http://papergirls.wordpress.com/2008/10/07/timeline-deep-web/">timeline of  deep Web</a> and related developments from 1980 to the present (database-backed content and early Web precursors,  of course, precede the Web itself and the term &#8216;deep Web&#8217;).</p>
<p>I have been directly involved in this field since 1994 and have not yet seen such a comprehensive treatment. She  cites studies noting &#8220;hundreds of thousands&#8221; of deep Web sites and the faster growth of dynamic (database-served)  as opposed to static (&#8217;surface&#8217;) content on the Web.</p>
<p>As someone directly involved in estimating the size of the deep Web, I appreciate the analytic difficulties and  take all of the estimates (my <em>own</em> older ones included!) with a grain of salt. Nonetheless, the deep Web is important,  its content is huge, often of unique and high quality, and it deserves serious attention by Web scientists.</p>
<p>Great job, Maureen! I always appreciate thorough researchers. (BTW, I suspect you might also like the <a href="../?page_id=327">Timeline of Information History</a>.)</p>
<h3>Trends and Role in the Semantic Web</h3>
<p><a href="http://mags.acm.org/communications/200810/?CFID=5461527&amp;CFTOKEN=11076271"><img src="../wp-content/themes/ai3/images/2008Posts/081009_CommunicationsACM.jpg" style="border: 0px solid ; margin-left: 10px; float: right" alt="Communications of the ACM" vspace="5" width="218" align="right" border="0" height="284" hspace="5" /></a>The next notable event was the publishing of <em><a href="http://mags.acm.org/communications/200810/?CFID=5461527&amp;CFTOKEN=11076271">Searching the Deep Web</a></em> by <a href="http://www.alexwright.org/">Alex Wright</a> in the <a href="http://www.acm.org/">Communications  of the ACM</a> (October 2008) <a href="#deep2">[2]</a>. Alex had first written about the deep Web for <a href="http://archive.salon.com/tech/feature/2004/03/09/deep_web/index_np.html">Salon magazine in 2004</a> and had given  nice attention to my company at that time, <a href="http://www.brightplanet.com/">BrightPlanet</a> <a href="#deep3">[3]</a>.</p>
<p>In this  current update, Alex does an excellent job of characterizing current status and research in search techniques for  the deep Web. I also liked the fact he used our fishing analogy of trawling for standard search crawlers versus  direct angling in the deep Web (see our earlier figure at upper left).</p>
<p>As some may recall, Google has stepped up its activities in this area, an event I <a href="../?p=436">reported on</a> a few months back. Those perspectives, and others from some other  notable figures, are included in Alex&#8217;s piece as well.</p>
<p>My own contribution to the piece was to suggest that RDF and semantic Web approaches offered the next  evolutionary stage in deep Web searching. Alex was able to take that theme and get some great perspectives on it. I  also appreciate the accuracy of my quotes, which gives me confidence in the quality for the rest of the story.</p>
<p>Without a doubt there is high quality in the deep Web and bringing structure and semantic characterization to it  through metadata is a task of some consequence.</p>
<p>For myself, I chose to move beyond the deep Web when its focus seemed stuck in a document-level perspective to  retrieval and analysis. However, there is much to be learned from the techniques used to select and access deep Web  content, which could be readily transferable to linked data.</p>
<p>Thanks, Alex, for making these prospects clearer! Maybe it is time to dust off some of my old stuff!</p>
<h3>Getting Deeper into the Semantics</h3>
<h2><a href="http://www.computer.org/portal/cms_docs_computer/computer/homepage/Sep08/r9itsys.pdf"><img src="../wp-content/themes/ai3/images/2008Posts/081009_Computer.jpg" style="border: 0px solid ; margin-right: 10px; float: left" alt="Trawling the Deep Web" vspace="5" width="218" align="left" border="0" height="282" hspace="5" /></a></h2>
<p>This emerging joining of deep Web and semantics is actually taking place through the efforts of a number of  academic researchers. Recently and prominently has been James Geller from the New Jersey Institute of Technology  and his colleagues Soon Ae Chun and Yoo Jung <a href="#deep4">[4]</a>. Their recently published paper, <a href="http://www.computer.org/portal/cms_docs_computer/computer/homepage/Sep08/r9itsys.pdf" style="font-style: italic">Toward the Semantic  Deep Web</a><a href="http://www.computer.org/portal/cms_docs_computer/computer/homepage/Sep08/r9itsys.pdf" style="font-style: italic">,</a> shows how ontologies  and semantic Web constructs can be combined to more effectively extract information from the deep Web. They call  this combination the &#8217;semantic deep Web.&#8217;</p>
<p>The authors posit that the structured roots of deep Web content lend themselves to better ontology learning  from the Web. They also point to the usefulness of deep Web structure to annotations.</p>
<p>That such confluences are occurring between the semantic and deep &#8220;Webs&#8221; is a function of focused academic  attention and the growing maturity of both perspectives. This year, for example, saw the inauguration of the first  Workshop on <a href="http://bis.kie.ae.poznan.pl/11th_bis/wscfp.php?ws=adw2008">Advances in Accessing Deep Web</a>  (ADW 2008). As part of the International Conference on Business Information Systems (<a href="http://bis.kie.ae.poznan.pl/11th_bis/">BIS 2008</a>), this meeting saw a lot of elbow rubbing with semantic Web  and enterprise topics.</p>
<p>It might seem strange (indeed, sometimes it does to me <img src='http://www.mkbergman.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  ) to envision structured database content being served  through a Web form and then converted via ontologies and other means to semantic Web formats. After all, why not go  direct to the data?</p>
<p>And, of course, direct conversion is less lossy and more efficient.</p>
<p>But, one interesting point is that semantic Web techniques are increasingly working as a structure-extraction  layer wrapping the standard Web. In that regard, starting with inherently structured source data &#8212; that is, the  deep Web &#8212; can lead to higher quality inputs across the distributed, heterogeneous content of the Web.</p>
<p>Given the impossibility of everyone starting with the same premises and speaking the same languages and  concepts, semantic Web mediation methods offer a way to overcome the Tower of Babel. And, when the starting content  itself is inherently structured and (generally) of higher quality &#8212; that is, the deep Web &#8212; the logic of the  combination becomes more obvious.</p>
<h3>For More Information</h3>
<p>Interested in learning more about the deep Web? I firstly recommend the resources posted at the bottom of  Flynn-Burhoe&#8217;s <a href="http://papergirls.wordpress.com/2008/10/07/timeline-deep-web/">timeline</a>. And, for a  very thorough treatment, I also recommend Denis Shestakov&#8217;s Ph.D. thesis from earlier this year <a href="#deep5">[5]</a>. It has a  bibliography of some 115 references. <a href="http://fgiasson.com/blog/index.php/2008/04/20/exploding-the-domain-umbel-web-services-by-zitgist/" style="font-style: italic"><br />
</a></p>
<hr style="margin: 15px 0px" size="1" width="33%" align="left" />
<div style="margin: 10px 0pt; font-size: 90%">  <a title="deep1" name="deep1" id="deep1"></a> [1] Michael K. Bergman, 2001. <a href="http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104">The Deep Web:  Surfacing Hidden Value</a>, <em>Journal of Electronic Publishing</em>. 7:1. Note, this publication was an update of an internal BrightPlanet study first published on July 26, 2000.</div>
<div style="margin: 10px 0pt; font-size: 90%"> <a title="deep2" name="deep2" id="deep2"></a> [2] Alex Wright, 2008. &#8220;Searching the Deep Web,&#8221; in <span style="font-style: italic">Communications of the ACM</span>, pp. 14-15, October 2008. See <a href="http://mags.acm.org/communications/200810/?CFID=5461527&amp;CFTOKEN=11076271">http://mags.acm.org/communications/200810/?CFID=5461527&amp;CFTOKEN=11076271</a>.</div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="deep3" name="deep3" id="deep3"></a> [3] Alex is also the author of <a href="http://www.alexwright.org/book.html" style="font-style: italic">GLUT: Mastering Information Through the Ages</a> (<span class="date">Joseph  Henry Press, 296 pp., July 2007;</span> <span class="date">ISBN 0309102383</span><span class="date">). BTW, I had <a href="../?p=408">earlier reviewed</a> this book with some criticisms, which should go a long way to prove Alex&#8217;s fairness and chops as a journalist.</span></div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="deep4" name="deep4" id="deep4"></a> [4] James Geller, Soon Ae Chun and Yoo Jung, 2008. <a href="http://www2.computer.org/portal/web/csdl/doi/10.1109/MC.2008.402">&#8220;Toward the Semantic Deep Web,&#8221;</a> in <span style="font-style: italic">Computer</span>, vol. 41, no. 9, pp. 95-97, Sept., 2008. See</div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="deep5" name="deep5" id="deep5"></a> [5] Dennis Shestakov, 2008. <a href="https://oa.doria.fi/bitstream/handle/10024/38506/diss2008shestakov.pdf?sequence=3" style="font-style: italic" target="_blank">Search  Interfaces on the Web: Querying and Characterizing</a>, PhD. dissertation from the University of Turku Centre for  Computer Science, Finland, 153 pp., May 2008.</div>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/458/new-currents-in-the-deep-web/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Another Deep Web Barrier Falls</title>
		<link>http://www.mkbergman.com/436/another-deep-web-barrier-falls/</link>
		<comments>http://www.mkbergman.com/436/another-deep-web-barrier-falls/#comments</comments>
		<pubDate>Sat, 12 Apr 2008 04:51:52 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=436</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Another Deep Web Barrier Falls&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-04-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/436/another-deep-web-barrier-falls/&amp;rft.language=English"></span>
As late as 2002, no single search engine indexed the entire surface Web.  There is much that has been written about that time, but emergence of Google (indeed others, it was a key battle at the time), worked to extend full search coverage to the Web, ending the need for so-called desktop metasearchers, then [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Another Deep Web Barrier Falls&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-04-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/436/another-deep-web-barrier-falls/&amp;rft.language=English"></span>
<p>As late as 2002, no single search engine indexed the entire <a href="http://en.wikipedia.org/wiki/Surface_web">surface Web</a>.  There is much that has been written about that time, but emergence of Google (indeed others, it was a key battle at the time), worked to extend full search coverage to the Web, ending the need for so-called desktop <a href="http://en.wikipedia.org/wiki/Metasearch">metasearchers</a>, then the only option for getting full Web search coverage.</p>
<p>Strangely, though full coverage of document indexing had been conquered for the Web, dynamic Web sites and database-backed sites fronted by search forms were also emerging.  Estimates as of about 2001, made by <a href="http://www.press.umich.edu/jep/07-01/bergman.html">myself</a> and others, suggested such &#8216;<a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>&#8216; content was many, many times larger than the indexable document Web and was found in literally hundreds of thousands of sites.</p>
<p>Standard Web crawling is a different technique and technology than &#8220;probing&#8221; the contents of searchable databases, which require a query to be issued to a site&#8217;s search form.  A company I founded, <a href="http://www.brightplanet.com/">BrightPlanet</a>, but many others such as <a href="http://www.copernic.com/">Copernic</a> or <a href="http://web.archive.org/web/20060105033921/http://www.intelliseek.com/">Intelliseek</a> and others, many of which no longer exist, were formed with the specific aim to probe these thousands of valuable content sites.</p>
<p>From those company&#8217;s standpoints, mine at that time as well, there was always the threat that the major search engines would draw a bead on deep Web content and use their resources and clout to appropriate this market.  Yahoo, for example, struck arrangements with some publishers of deep content to index their content directly, but that still fell short of the different technology that deep Web retrieval requires.</p>
<p>It was always a bit surprising that this rich storehouse of deep Web content was being neglected.  In retrospect, perhaps it was understandable:  there was still the standard Web document content to index and conquer.</p>
<p>Today, however, Google posted on one of its developer blog sites, <a href="http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html">Crawling through HTML forms</a>, written by Jayant Madhavan and Alon Halevy, noted search and semantic Web researcher, announcing its new deep Web search:</p>
<p class="boxGreenDotted">In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn&#8217;t find and index for users who search on Google. Specifically, when we encounter a &lt;FORM&gt; element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.</p>
<p>To be sure, there are differences and nuances to retrieval from the deep Web.  What is described here is not truly directed nor comprehensive.  But, the barrier has fallen.  With time, and enough servers, the more inaccessible aspects of the deep Web will fall to the services of major engines such as Google.</p>
<p>And, this is a good thing for all consumers desiring full access to the Web of documents.</p>
<p>So, an era is coming to a close.  And this, too, is appropriate.  For we are also now transitioning into the complementary era of the Web of data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/436/another-deep-web-barrier-falls/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What is the Structured Web?</title>
		<link>http://www.mkbergman.com/390/what-is-the-structured-web/</link>
		<comments>http://www.mkbergman.com/390/what-is-the-structured-web/#comments</comments>
		<pubDate>Wed, 18 Jul 2007 17:05:58 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Adaptive Information]]></category>
		<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Information Automation]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Structured Web]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=390</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=What is the <em>Structured Web</em>?&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=Open Source&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2007-07-18&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/390/what-is-the-structured-web/&amp;rft.language=English"></span>

The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.
Over the past few months I have increasingly been writing about and referring to the structured Web.  I [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=What is the <em>Structured Web</em>?&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=Open Source&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2007-07-18&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/390/what-is-the-structured-web/&amp;rft.language=English"></span>
<div style="float: left"><a href="http://www.chemicalgraphics.com/paul/"><img src="http://www.mkbergman.com/wp-content/themes/ai3/images/2007Posts/070718_dna_candy.jpg" alt="Image from Paul Thiessan" style="border: 0px solid ; margin-right: 10px; width: 160px; height: 294px" /></a></div>
<div style="border: 1px solid #820000; margin: 20px 30px 20px 210px; padding: 8px; background-color: #f9f9f9" align="center">The <span style="font-style: italic; font-weight: bold">structured Web</span> is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.</div>
<p>Over the past few months I have increasingly been writing about and referring to the <span style="font-style: italic">structured Web</span>.  I have done so purposefully, but, so far, with little background or explication.  With the inauguration of this occasional series, I hope to bring more color and depth to this topic [<a href="#struct1">1</a>].</p>
<p>Literally, over the past year, I have been learning and documenting on <span style="font-weight: bold; color: #820000">AI3</span> my attempts to understand the basis, concepts and tools of the emerging semantic Web.  In that process, I have come to define my own outlines of the Web past, present and future.  Within this world view, I see the <span style="font-style: italic">structured Web</span> as today&#8217;s current imperative and reality.</p>
<h3>Confusing Terminology Surrounding Obvious Change</h3>
<p>Some Web pundits have embraced a versioning terminology of Web 2.0 and Web 3.0 to describe one such world view.  I don&#8217;t personally agree with this silly versioning &#8212; indeed I poked fun in a tongue-in-cheek <a href="http://www.mkbergman.com//?p=248">posting about Web 98.6</a> more than a year ago &#8212; but such terminology has gotten some traction and serves a purpose.  I actually give my own definitions for such &#8220;versions&#8221; below if for no other reason than to close the gap with alternative world views.</p>
<p>We need not go back to the alternative early protocols of <a href="http://en.wikipedia.org/wiki/Usenet">Usenet</a> (and news groups), <a href="http://en.wikipedia.org/wiki/Gopher_protocol">Gopher</a> and <a href="http://en.wikipedia.org/wiki/File_Transfer_Protocol">FTP</a> and their search engines of <a href="http://en.wikipedia.org/wiki/Veronica_%28computer%29">Veronica</a>, <a href="http://en.wikipedia.org/wiki/Wide_area_information_server">WAIS</a>, <a href="http://en.wikipedia.org/wiki/Jughead_%28computer%29">Jughead</a> or <a href="http://en.wikipedia.org/wiki/Archie_search_engine">Archie</a> in 1991 [<a href="#struct2">2</a>] when <a href="http://en.wikipedia.org/wiki/Tim_Berners-Lee">Tim Berners-Lee</a> first publicly announced the World Wide Web and its combination of hypertext with the Internet.  More likely, the release of the <a href="http://en.wikipedia.org/wiki/Mosaic_%28web_browser%29">Mosaic browser</a> and <a href="http://en.wikipedia.org/wiki/CERN">CERN</a>&#8217;s decision to make access to the Web free in 1993 marked the true take-off point for the Web and the continued demise of the competing protocols.</p>
<p>Images and links in Web pages (&#8221;documents&#8221;) plus the HTML mark-up language to enable the styling and graphical design of those pages were very much in keeping with general trends, paralleling the <a href="http://en.wikipedia.org/wiki/History_of_the_graphical_user_interface">earlier transition of personal computers to graphical interfaces</a> and away from terminals.  Mosaic became the foundation for the <a href="http://en.wikipedia.org/wiki/Netscape">Netscape</a> browser, best links compilations became a big hit through sites like <a href="http://en.wikipedia.org/wiki/Yahoo">Yahoo!</a>, and the Lycos search engine, one of the first profitable Web ventures, indexed a mere 54,000 pages when it was publicly released in 1994 [<a href="#struct3">3</a>].</p>
<p>This initial start to the Web &#8212; today now referred to by some as &#8216;Web 1.0&#8242; &#8212; can be squarely timed to 1993-1994.  By 1995, the Web was appearing on the covers of major news magazines and by 1996 the phenomenon was at full throttle.  But, since these early beginnings, the Web has gone through many different &#8220;versions&#8221; and transitions, most not fitting with version numbers, as some of these examples show:</p>
<ul>
<li>Academic <span style="font-style: italic">v</span>. Commercial Web &#8212; magazines like Wired, Red Herring, Business 2.0 and the mainstream press showered us with names such as <a href="http://en.wikipedia.org/wiki/E-commerce">e-commerce</a>,  <a href="http://en.wikipedia.org/wiki/Dot-com">dot-com</a> and the gold rush for companies to establish a Web presence, <a href="http://en.wikipedia.org/wiki/B2b">B2B</a>, etc. in the latter part of the 1990s.  In fact, for some early architects of the Web, this was a period of some trauma and handwringing, since the &#8220;pure&#8221; open and academic roots of the Internet and the Web were being taken over by mainstream use, commercialization and the monied dominance of venture capital.  This first major change in the Web, its first major new &#8216;version&#8217; if you will, came back down to earth as a result of the &#8220;dot-com bust&#8221; of the bubble in 2001 [<a href="#struct4">4</a>]</li>
<li>Static <span style="font-style: italic">v.</span> Dynamic Web &#8212; all initial Web content was based on documents created by hand and posted as individual and hyperlinked Web pages. The relatively few documents of the early Web meant that hand-compiled &#8220;best of&#8221; listings such as Yahoo! worked pretty well; &#8216;<a href="http://en.wikipedia.org/wiki/Metasearch">metasearchers</a>&#8216; also emerged to overcome the limited indexing coverage of early search engines.  These trends, however, were also masking another version sea-change for the Web.  With growth and more content, many larger sites were moving to dynamic page generation with retrieval via search forms.  This dynamic portion of the Web, called at times either the &#8216;<a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>&#8216; or &#8216;invisible Web,&#8217; acted like standard search engines and therefore was generally overlooked until I first popularized this change in 2000 [<a href="#struct5">5</a>].  I would argue that the shift to dynamic content, with certainly hundreds of thousands of such database-backed sites now in existence &#8212; and content many times larger than what is indexed by standard search engines &#8212; was also a major version shift for the Web</li>
<li>Open Source and Open Data &#8211;the open source <a href="http://en.wikipedia.org/wiki/Linux">Linux</a> and the <a href="http://en.wikipedia.org/wiki/Apache_HTTP_Server">Apache Web server</a> have been two software foundations to the growth of the Web, and <a href="http://en.wikipedia.org/wiki/Mysql">MySQL</a> has had a leading role in supporting sites and software with database-backed designs [<a href="#struct6">6</a>].  It is beyond the scope of this piece, but I believe that the dot-com frenzy, the demise of Netscape by Internet Explorer and other tensions with commercial interests, plus the very empowering nature of the Internet itself are also leading to a version change of the Web from commercial software products to open source ones.  Further, proprietary publishers and data sources have only had limited success on the Web; we are now seeing strong trends to open data as well.  Additionally, the very nature of open source software lends itself to interoperability and modularity based on naturally selected building blocks.  This &#8220;open&#8221; infrastructural basis of the Web is more subtle and hard to see, but provides some powerful drivers for how more surface-oriented trends express themselves</li>
<li>Social Networking Web &#8212; the same early software that enabled dynamic Web pages and database-backed designs naturally lent themselves to early <a href="http://en.wikipedia.org/wiki/Blog">blogs</a>, <a href="http://en.wikipedia.org/wiki/Wiki">wikis</a> and  <a href="http://en.wikipedia.org/wiki/Content_management_system">content-management systems</a>, many backed by MySQL, which in turn led to more community-oriented designs and services such as <a href="http://en.wikipedia.org/wiki/Del.icio.us">del.icio.us</a> for bookmarking, <a href="http://en.wikipedia.org/wiki/Flickr">Flickr</a> for photos, later <a href="http://en.wikipedia.org/wiki/Youtube">YouTube</a> for videos, and literally thousands of others.  This trend, resulting from changed practices and the use of different tools and ways to harness <a href="http://en.wikipedia.org/wiki/User-generated_content">user-generated content</a>, and not resulting from any changes to standards <span style="font-style: italic">per se</span>, was first called &#8216;<a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0&#8242;</a> by Tim O&#8217;Reilly in about 2003</li>
<li>Ajax and Widgets &#8212; some would include <a href="http://en.wikipedia.org/wiki/Web_service">Web services</a>, APIs and &#8216;<a href="http://en.wikipedia.org/wiki/Mashup_%28web_application_hybrid%29">mashups</a>&#8216; in the Web 2.0, often as expressed through embedded Web &#8216;<a href="http://en.wikipedia.org/wiki/Web_widget">widgets</a>&#8216; and the use of <a href="http://en.wikipedia.org/wiki/Ajax_%28programming%29">Ajax</a> or similar dynamic scripting approaches.  These considerations were not part of the original Web 2.0 term, but usage today likely embraces aspects of these in many definitions of Web 2.0.  In any case, there is certainly a change within the Web to more interactive, attractive, full-featured user interfaces, with interface updates no longer requiring a full Web page refresh</li>
<li>Document-centric Web <span style="font-style: italic">v.</span> Data-centric Web &#8212; however, in any event, portions of these trends and changes are more broadly combining to represent another version change in the Web from one solely focused on documents to one that is more data-centric; this topic, the basis for the term &#8216;<span style="font-style: italic">structured Web</span>,&#8217; is more fully discussed below</li>
<li>Web 3.0 &#8212; Wikipedia states, &#8220;Web 3.0 is a term that has been coined with different meanings to describe the evolution of Web usage and interaction among several separate paths. These include transforming the Web into a <a href="http://en.wikipedia.org/wiki/Database" title="Database">database</a>, a move towards making content accessible by multiple non-browser applications, the leveraging of <a href="http://en.wikipedia.org/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> technologies, the <a href="http://en.wikipedia.org/wiki/Semantic_web" title="Semantic web">Semantic web</a>, or the <a href="http://en.wikipedia.org/wiki/Geoweb" title="Geoweb">Geospatial Web</a>.&#8221;  Of all current terms, this one is fully the silliest, since there is no consensus on what it represents nor its endpoints</li>
<li>Semantic Web &#8212;  the <a href="http://www.w3.org/2003/glossary/keyword/All/?keywords=semantic+web">glossary</a> at <a href="http://www.w3.org/">W3C</a> states that the semantic Web is &#8220;the Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it.&#8221;  Elsewhere, the vision of the <a href="http://www.w3.org/2001/sw/SW-FAQ#What1">semantic Web</a> is described by the Education and Outreach working group (SWEO) of the W3C &#8220;to extend principles of the Web from documents to data. This extension will allow to fulfill more of the Web&apos;s potential, in that it will allow data to be shared effectively by wider communities, and to be processed automatically by tools as well as manually.&#8221;  Note the importance of computer processing and autonomy in these statements, not to mention the pivotal term of &#8217;semantics.&#8217;  This is an expansive and wide-embracing vision, some challenges of which I more fully describe below, and</li>
<li>Visions of the Web &#8212; the semantic Web vision is matched with other visions, including voice activation, autonomous agents doing our bidding in the background, wireless interlinked everything, and other versions of the Web that are sometimes portrayed in science fiction.  Whenever such transitions occur, they will all surely rely on all the various &#8220;versions&#8221; of the Web that have occurred in the short past 15 years of the Web&#8217;s existence.</li>
</ul>
<p>Despite these differences in viewpoint, language does matter.  Though some may view language as a contest in &#8220;branding,&#8221; which can legitimately apply in other venues, I think the issue here goes well beyond &#8220;branding.&#8221;  Language is also necessary to aid communication.</p>
<p>As I explain below and elaborate upon more fully throughout this series, I believe one of the correct terms for the current evolutionary state of the Web is the &#8216;<span style="font-style: italic">structured Web</span>.&#8217;</p>
<h3>A Clear Transition to a Data-centric Web</h3>
<p>As noted, portions of these trends and changes are more broadly combining to represent another transitional change in the Web from one solely focused on documents to one that is more object- or data-centric.  Evidence of this trend includes such factors as:</p>
<ul>
<li>Broad database-backed Web site designs, with content re-purposed and served up dynamically, the trend first noted as the &#8216;deep Web&#8217;</li>
<li>&#8216;Mashups&#8217; of data from multiple sources, such as in maps, timelines, etc.</li>
<li>The exposure of Web services and APIs.  The <a href="http://www.programmableweb.com/">programmableweb.com</a>, for example, documents a doubling of such sources in the past nine months via its listing (as of July 2007) of about 500 APIs and more than 2,100 mashups</li>
<li>Huge growth and availability of large, often public, <a href="http://www.cs.umd.edu/class/spring2006/cmsc838s/data_repositories/repository_us.html">data sources</a>, from US government and social sources like <a href="http://dbpedia.org/docs/">DBpedia</a>, an RDF data extraction from Wikipedia (and others)</li>
<li>The emergence of entire data-centric sources, services and mashup platforms such as <a href="http://www.freebase.com/signin/signin">Freebase</a>, Yahoo! <a href="http://pipes.yahoo.com/pipes/">Pipes</a>, Google <a href="http://base.google.com/">Base</a>, <a href="http://www.teqlo.com/">Teqlo</a>, <a href="http://services.alphaworks.ibm.com/qedwiki/">QEDwiki</a>,  <a href="http://www.ning.com/">Ning</a>, and <a href="http://openkapow.com/">OpenKapow</a></li>
<li>The rapid &#8212; and now almost universal &#8212; availability of data format converters (mostly to RDF) such as the listings of the W3C&#8217;s <a href="http://esw.w3.org/topic/ConverterToRdf">RDF Converters</a> and MIT&#8217;s &#8216;<a href="http://simile.mit.edu/wiki/RDFizers">RDFizers</a>,&#8217; the <a href="http://www.w3.org/TR/grddl/">GRDDL</a> initiative,  <a href="http://triplr.org/">Triplr</a>, and the like</li>
<li>Soon, other to-be announced major data source look-up references, directories and conversion and filtering services.</li>
</ul>
<p>One of the most popular series of presentations at this year&#8217;s <a href="http://www2007.org/">WWW2007</a> conference in Banff was from the <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData">Linked Open Data</a> project of the <a href="http://www.w3.org/2001/sw/sweo/">SWEO interest group</a>.  The members of this LOD project &#8212; comprised of accomplished advocates, developers and theorists &#8212; are providing the awareness, tools and example data that are showing how this emerging version may look.  In fact, the group has just announced crossing the threshold of 1 billion &#8216;triples&#8217; with 180,000 interlinks within its online DBpedia service, via these sources:</p>
<div align="center"> <img src="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData?action=AttachFile&amp;do=get&amp;target=lod-datasetsV2.png" title="attachment:lod-datasetsV2.png" /></div>
<p>The LOD&#8217;s term for this effort is &#8216;<a href="http://en.wikipedia.org/wiki/Linked_Data">linked data</a>&#8216;, and a Web site has been established to promote it.  Others, harking back to Tim Berners-Lee&#8217;s original definition, refer to current efforts as a &#8216;Web of data&#8217; or the &#8216;Semantic Data Web.&#8217;  Kingsley Idehen has been promoting the idea of &#8216;<a href="http://www.openlinksw.com/blog/%7Ekidehen/?id=1185">data spaces</a>&#8216; &#8212; personal and collective &#8212; that is also a powerful metaphor.</p>
<p>Frankly, I think all of these terms are correct and useful.  Yet I prefer the term <span style="font-style: italic">structured Web</span> because it is both <span style="font-weight: bold; font-style: italic">more</span> and <span style="font-weight: bold; font-style: italic">less</span> than some of these other terms.</p>
<p>The <span style="font-style: italic">structured Web</span> is <span style="font-weight: bold; font-style: italic">more</span> in that it pertains to any data formalism in use on the Web and includes the notion of extracting structure from uncharacterized content, by far the largest repository of potentially useful information on the Web.  Yet the <span style="font-style: italic">structured Web</span> is also <span style="font-weight: bold; font-style: italic">less</span> because its ambition is solely to get that data into an interoperable framework and to forgo the full objectives of the &#8216;Semantic Web.&#8217;  In that regard, my concept of the <span style="font-style: italic">structured Web</span> is perhaps closest to the idea of <a href="http://en.wikipedia.org/wiki/Linked_Data">linked data</a>, though with less insistence on &#8220;correct&#8221; RDF and with specific attention to structure extraction from uncharacterized content.</p>
<h3>Remarkable Progress on a Still Incomplete Journey</h3>
<p>One of today&#8217;s realities is that we have accomplished much but still have a long way to go to achieve the grand vision of the &#8216;Semantic Web&#8217; (capitalized).</p>
<p>More than a year ago I wrote a piece on &#8220;<a href="http://www.mkbergman.com//?p=229">Climbing the Data Federation Pyramid</a>&#8221; that noted the tremendous progress that has been made in the last twenty years in overcoming many seemingly intractable issues in data interoperability, initially of a physical and hardware nature.  The Internet and Web standards have made enormous contributions to that progress.</p>
<p>The diagram I used in that piece is shown below [<a href="#struct7">7</a>].  Reaching the pyramid&#8217;s pinnacle could be argued as having achieved the grand vision of the Semantic Web.  With the adoption of the Internet and Web protocols, all layers up through data representation have largely been solved.  Data representation, data models, schema for different world views, and means for reconciling and mediating those different world views are much of the focus of today&#8217;s conceptual challenges.</p>
<p>Note, as we discuss the <span style="font-style: italic">structured Web</span> that we are largely focusing on the layer dealing with <span style="font-style: italic">data representation</span>, with some minor portions (principally in disambiguation) dealing with <span style="font-style: italic">semantics</span>.  Getting data into a canonical data representation or model still leaves very crucial challenges in what does the data mean (its semantics), reasoning over that data (inference and pragmatics), and whether the data is authoritative or can be trusted.  These are the daunting &#8212; and largely remaining challenges &#8212; of the Semantic Web.</p>
<p align="center"><img src="http://www.mkbergman.com/wp-content/themes/ai3/images/2006Posts/060524a_DataFederationPyramid.gif" height="422" width="596" /></p>
<p>For example, let&#8217;s look solely at the layer of <span style="font-style: italic">semantics</span>, the immediate challenge after <span style="font-style: italic">data representation</span>.  By <span style="font-style: italic">semantics</span>, we are referring to whether different statements from different sources indeed refer or not to the same entity or concept; in other words, have the same <span style="font-style: italic">meaning</span>.  Such a determination is pivotal if we are to combine data from multiple sources.</p>
<p>The use of RDF, accurate name spaces and syntactically correct URIs aid this resolution, but do not completely solve it.  Ultimately, semantic mediation (such as my &#8220;glad&#8221; is equivalent to your &#8220;happy&#8221;) means resolving or mediating potential heterogeneities from on the order of <a href="http://www.mkbergman.com//?p=232">40 discrete categories of potential mismatches</a> from units of measure, terminology, language, and many others.  These sources may derive from structure, domain, data or language, as shown in this table [<a href="#struct8">8</a>]:</p>
<div style="margin: 15px 0px" align="center">
<table border="1" cellpadding="3" cellspacing="0">
<tr>
<td style="width: 127px; background-color: #cccccc; vertical-align: middle; text-align: center; font-weight: bold">Class</td>
<td style="width: 142px; background-color: #cccccc; vertical-align: middle; text-align: center; font-weight: bold">Category</td>
<td style="width: 350px; background-color: #cccccc; vertical-align: middle; text-align: center; font-weight: bold">Sub-category</td>
</tr>
<tr>
<td rowspan="15" style="width: 127px"><strong>STRUCTURAL</strong></td>
<td rowspan="4" style="width: 142px">
<p align="center">Naming</p>
</td>
<td style="width: 350px">Case Sensitivity</td>
</tr>
<tr>
<td style="width: 350px">Synonyms</td>
</tr>
<tr>
<td style="width: 350px">Acronyms</td>
</tr>
<tr>
<td style="width: 350px">Homonyms</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Generalization / Specialization</td>
</tr>
<tr>
<td rowspan="2" style="width: 142px">Aggregation</td>
<td style="width: 350px">Intra-aggregation</td>
</tr>
<tr>
<td style="width: 350px">Inter-aggregation</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Internal Path Discrepancy</td>
</tr>
<tr>
<td rowspan="4" style="width: 142px">Missing Item</td>
<td style="width: 350px">Content Discrepancy</td>
</tr>
<tr>
<td style="width: 350px">Attribute List Discrepancy</td>
</tr>
<tr>
<td style="width: 350px">Missing Attribute</td>
</tr>
<tr>
<td style="width: 350px">Missing Content</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Element Ordering</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Constraint Mismatch</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Type Mismatch</td>
</tr>
<tr>
<td rowspan="8" style="width: 127px"><strong>DOMAIN</strong></td>
<td rowspan="4" style="width: 142px">Schematic Discrepancy</td>
<td style="width: 350px">Element-value to Element-label Mapping</td>
</tr>
<tr>
<td style="width: 350px">Attribute-value to Element-label Mapping</td>
</tr>
<tr>
<td style="width: 350px">Element-value to Attribute-label Mapping</td>
</tr>
<tr>
<td style="width: 350px">Attribute-value to Attribute-label Mapping</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Scale or Units</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Precision</td>
</tr>
<tr>
<td rowspan="2" style="width: 142px">Data Representation</td>
<td style="width: 350px">Primitive Data Type</td>
</tr>
<tr>
<td style="width: 350px">Data Format</td>
</tr>
<tr>
<td rowspan="7" style="width: 127px"><strong>DATA</strong></td>
<td rowspan="4" style="width: 142px">Naming</td>
<td style="width: 350px">Case Sensitivity</td>
</tr>
<tr>
<td style="width: 350px">Synonyms</td>
</tr>
<tr>
<td style="width: 350px">Acronyms</td>
</tr>
<tr>
<td style="width: 350px">Homonyms</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">ID Mismatch or Missing ID</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Missing Data</td>
</tr>
<tr>
<td colspan="2" style="width: 492px">Incorrect Spelling</td>
</tr>
<tr>
<td rowspan="8" style="width: 127px"><strong>LANGUAGE</strong></td>
<td rowspan="4" style="width: 142px">Encoding</td>
<td style="width: 350px">Ingest Encoding Mismatch</td>
</tr>
<tr>
<td style="width: 350px">Ingest Encoding Lacking</td>
</tr>
<tr>
<td style="width: 350px">Query Encoding Mismatch</td>
</tr>
<tr>
<td style="width: 350px">Query Encoding Lacking</td>
</tr>
<tr>
<td rowspan="4" style="width: 142px">Languages</td>
<td style="width: 350px">Script Mismatches</td>
</tr>
<tr>
<td style="width: 350px">Parsing / Morphological Analysis Errors (many)</td>
</tr>
<tr>
<td style="width: 350px">Syntactical Errors (many)</td>
</tr>
<tr>
<td style="width: 350px">Semantic Errors (many)</td>
</tr>
</table>
</div>
<p>Using the same data model (say, RDF) or the same name spaces (say, Dublin Core or FOAF) helps somewhat to remove some of these sources of heterogeneity, but not all.  Undoubtedly, longer term, resolving these heterogeneities will prove tractable.  But they are not easily so today.</p>
<p>This observation does not undercut the Semantic Web vision nor negate the massive labors in support of that vision taken to date.  But, hopefully, this observation may bring some perspective to the task ahead to obtain that vision.</p>
<h3>Lowering Our Sights<span style="font-style: italic"></span></h3>
<p>If nothing else, the reality of the past 15 years shows us that the Web is a &#8220;dirty,&#8221; chaotic place.  If HTML coding can be screwed up, it will.  If loopholes in standards and protocols exist, they will be exploited.  If there is ambiguity, all interpretations become possible, with many passionately held.  Innovation and unintended uses occur everywhere.</p>
<p>This should not be surprising, and experienced Web designers, scientists and technologists should all know this by now.  There can be no disconnect between workable standards and approaches and actual use in the &#8220;wild.&#8221;  Nuanced arguments over the subtleties of standards and approaches are bound to fail.  Robustness, simplicity and forgiveness must take precedence over elegance and theoretical completeness.</p>
<p>While there has been obvious growth in the sophistication of Web sites and the underlying technologies that support them, we see continued use of obsolete approaches that clearly should have been abandoned long ago (such as Web-safe colors, small displays, older browser versions, Web pages parked on some servers that have not been modified or looked at by their original authors in a decade, etc.).  We also see slow uptake for clearly &#8220;better&#8221; new approaches.  And we also sometimes see explosive uptake of approaches and ideas that seemingly come out of nowhere.</p>
<p>We also see that those approaches that enjoy the greatest success &#8212; blogging, tagging, microformats, RSS, widgets, for example, come most recently to mind &#8212; are those that the &#8220;citizen&#8221; user can easily and readily embrace.  HTML was pretty foreign at first, but now millions of users modify their own code.  Millions of users of various CMS systems and Firefox have learned how to install plug-ins and add-ins and modify CSS themes and use administration consoles.</p>
<p>So, my observation and argument is not that we must always choose what is mindless and unchallenging.  But my argument is that we must accept real-world diversity and seek simplicity, robustness and clarity for what is new.</p>
<p>After nearly a decade of standards work, the basis for beginning the transition to the semantic Web is in place.  But that vision itself sometimes appears too demanding, too intimidating.  The vision at times appears all too unreachable.</p>
<p>Of course, this perception is wrong.  Measured over many years, perhaps some decades, the vision of the semantic Web <span style="font-style: italic; font-weight: bold">is</span> reachable.  Much remains to be worked on and understood regarding this vision in terms of mediating and resolving semantic heterogeneities, capturing and expressing world views through formal ontologies, making inferences between these views, and establishing trust and authoritativeness.  And those challenges do not yet address the even more-exciting prospects of intelligent and autonomous agents.</p>
<p>Rather, the rationale for the <span style="font-style: italic">structured Web</span> is to tone down the vision, stay with the here and now, focus on what is achievable today.  And what is achievable today is very great.</p>
<h3>Why This Series on the &#8216;<span style="font-style: italic">Structured Web</span>&#8216;?</h3>
<p>Though version numbers for the Web are silly, with &#8216;Web 3.0&#8242; for the semantic Web possibly being the silliest of all, such attempts do speak to the need for providing handles and language for capturing the dynamic change, diversity and complexity of the Web.</p>
<p>Today, right now, and all around us, a fundamental transition is taking place in the Web from a document-centric to a data-centric environment.  A confluence of standards, advocacies, and previous trends are fueling this transition.  Since the practical building blocks already exist, we will see this <span style="font-style: italic">structured Web</span> unfold before us at amazing speed.</p>
<p>The concept of the <span style="font-style: italic">structured Web</span> is thus narrower and less ambitious in scope than the &#8216;Semantic Web.&#8217;  The <span style="font-style: italic">structured Web</span> is merely a transitional step on the journey to the vision of the semantic Web, albeit one that can be fully realized today with current technologies and current understandings.</p>
<p>The purpose of this new series is thus to give prominence to this transition and to highlight the pragmatic, practical building blocks available to contribute to this transition.  By somewhat shifting boundary definitions, the idea of the <span style="font-style: italic">structured Web</span> also aims to give more prominence to the importance of usability and structure extraction from semi-structured and unstructured content.  These, too, are exciting areas with much potential.</p>
<p>So, as a way to provide a touchstone for continued discussion on this matter, here is one working definition of the <span style="font-style: italic">structured Web</span>:</p>
<div cellspacing="0" cellpadding="8" style="border: 1px solid #820000; margin: 20px 30px; padding: 8px 5px; background-color: #f9f9f9" align="center"> The <span style="font-style: italic; font-weight: bold">structured Web</span> is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.</div>
<h3>Anticipated Topics in this Series</h3>
<p>Some of the tentative topics that I plan to address in this series include discussion of what constitutes &#8217;structure&#8217; in content, why structure is important, the various existing forms of structure, human<span style="font-style: italic"> v.</span> machine bases for viewing and interpreting structure, the importance of finding &#8220;canonical&#8221; representation forms while also appreciating real-world diversity, the means to convert data forms and serializations, the means to extract structure from all types of content, transitioning to semantic understandings, and likely others.</p>
<p>Others may be added to this series over time and will be categorized under &#8216;<a href="http://www.mkbergman.com//?cat=23">Structured Web</a>&#8216; on the <span style="font-weight: bold; color: #820000">AI3</span> blog.</p>
<div cellspacing="0" cellpadding="8" style="border: 1px dotted #004182; margin: 20px 160px; padding: 8px; background-color: #fffff0; font-size: 90%" align="center"> This posting is the first part of a new, occasional series on the <a href="http://www.mkbergman.com//?cat=23" style="font-weight: bold">Structured Web</a>, which also has its own new category.  There are some additional prior topics in this series.</div>
<div style="margin: 15px 0px">
<hr style="width: 33%; height: 1px" align="left" /></div>
<p><a title="struct1" name="struct1"></a>[1] You will note a heavy emphasis on Wikipedia definitions and histories in this piece, in keeping with the general theme of versions and transitions on the Web.</p>
<p><a title="struct2" name="struct2"></a>[2] News groups really did not have a good search engine until the launch of <a href="http://en.wikipedia.org/wiki/Deja_News">Deja News</a> in 1995.</p>
<p><a title="struct3" name="struct3"></a>[3] Chris Sherman, &quot;Happy Birthday, Lycos!,&quot; <span style="font-style: italic">Search Engine Watch</span>, August 14, 2002. See <a href="http://searchenginewatch.com/showPage.html?page=2160551" class="liexternal">http://searchenginewatch.com/showPage.html?page=2160551</a>.</p>
<p><a title="struct4" name="struct4"></a>[4] A fairly good summary of the <a href="http://en.wikipedia.org/wiki/History_of_the_World_Wide_Web">History of the Web</a> can be found on Wikipedia.</p>
<p><a title="struct5" name="struct5"></a>[5] Michael K. Bergman (Aug 2001). &#8220;<a href="http://www.press.umich.edu/jep/07-01/bergman.html" class="external text" title="http://www.press.umich.edu/jep/07-01/bergman.html" rel="nofollow">The Deep Web: Surfacing Hidden Value</a>&#8220;. <em>The Journal of Electronic Publishing</em> <strong>7</strong> (1). An earlier version of this paper was published by BrightPlanet Corp. in July 2000.</p>
<p><a title="struct6" name="struct6"></a>[6] While there are variations, Linux, Apache, MySQL and the scripting languages of either Python, PHP, or Perl are often referred to as &#8216;<a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29">LAMP</a>&#8216;, one central basis for much open source software and, more broadly, interoperable open-source application packages.</p>
<p><a title="struct7" name="struct7"></a>[7] I would make a few changes today, notably in deprecating XML somewhat.</p>
<p><a title="struct8" name="struct8"></a>[8] This table builds on Pluempitiwiriyawej and Hammer&apos;s schema by adding the fourth major category of language.  See Charnyote Pluempitiwiriyawej and Joachim Hammer, &quot;A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources,&quot; <em>Technical Report TR00-004</em>, University of Florida, Gainesville, FL, 36 pp., September 2000. See <a href="ftp://ftp.dbcenter.cise.ufl.edu/Pub/publications/tr00-004.pdf" class="lifilepdf">ftp.dbcenter.cise.ufl.edu/Pub/publications/tr00-004.pdf</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/390/what-is-the-structured-web/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>The Murky Depths of the &#8216;Deep Web&#8217;</title>
		<link>http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/</link>
		<comments>http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/#comments</comments>
		<pubDate>Wed, 21 Feb 2007 18:22:14 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Document Assets]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=343</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=The Murky Depths of the &#8216;Deep Web&#8217;&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Document Assets&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2007-02-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/&amp;rft.language=English"></span>
It&#8217;s Taken Too Many Years to Re-visit the &#8216;Deep Web&#8217; Analysis 
It&#8217;s been seven years since Thane Paulsen and I first coined the term &#8216;deep Web&#8216;, perhaps representing a couple of full generational cycles for the Internet.  What we knew then and what &#8220;Web surfers&#8221; did then has changed markedly.  And, of course, [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=The Murky Depths of the &#8216;Deep Web&#8217;&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Document Assets&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2007-02-21&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/&amp;rft.language=English"></span>
<p><img hspace="5" align="left" alt="Deep Web" title="Deep Web" src="http://www.mkbergman.com/wp-content/themes/ai3/images/2007Posts/070221a_DeepWeb.png" /><font color="#820000"><strong>It&#8217;s Taken Too Many Years to Re-visit the &#8216;Deep Web&#8217; Analysis </strong></font></p>
<p>It&#8217;s been seven years since <a title="Thane Paulsen" href="http://www.brightplanet.com/company/brightplanet/managers-and-directors.html">Thane Paulsen</a> and I first coined the term &#8216;<a title="Deep Web" href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>&#8216;, perhaps representing a couple of full generational cycles for the Internet.  What we knew then and what &#8220;Web surfers&#8221; did then has changed markedly.  And, of course, our coining of the term and <a title="BrightPlanet" href="http://www.brightplanet.com">BrightPlanet&#8217;s</a> publishing of the first quantitative study on the deep Web did nothing to create the phenomenon of dynamic content itself &#8212; we merely gave it a name and helped promote a bit of understanding within the general public of some powerful subterranean forces driving the nature and tectonics of the emerging Web.</p>
<p>The first public release of <em><a title="Original Deep Web Paper Release" href="http://web.archive.org/web/20000816013534/128.121.227.57/download/deepwebwhitepaper.pdf">The Deep Web:  Surfacing Hidden Value</a></em> (courtesy of the Internet Archive&#8217;s <a title="Internet Archive's Wayback Machine" href="http://www.archive.org/web/web.php">Wayback Machine</a>), in July 2000, opened with a bold claim:</p>
<blockquote><p><em>BrightPlanet has uncovered the &quot;deep&quot; Web &#8212; a vast reservoir of Internet content that is 500 times larger than the known &quot;surface&quot; World Wide Web.  What makes the discovery of the deep Web so significant is the quality of content found within.  There are literally hundreds of billions of highly valuable documents hidden in searchable databases that cannot be retrieved by conventional search engines.</em></p></blockquote>
<p>The day the study was released we needed to increase our servers nine-fold to meet news demand after <a title="CNN" href="http://www.cnn.com">CNN</a> and then 300 major news outlets eventually picked up the story.  By 2001 when the University of Michigan&#8217;s <em><a title="JEP Deep Web Version" href="http://www.press.umich.edu/jep/07-01/bergman.html">Journal of Electronic Publishing</a></em> and its wonderful editor, <a title="Judith Axler Turner" href="http://www.turner.net/employee-Judith_Axler_Turner.html">Judith A. Turner</a>, decided to give the topic renewed thrust, we were able to clean up the presentation and language quite a bit, but did little to actually update many of the statistics.  (That version, in fact, is the one mostly cited today.)</p>
<p>Over the years there have been some books published and other estimates put forward, more often citing lower amounts in the deep Web than my original estimates, but, with one exception (see below), none of these were backed by new analysis.  I was asked numerous times to update the study, and indeed had even begun collating new analysis at a couple of points, but the effort to complete the work was substantial and the effort always took a back seat to other duties and so was never completed.</p>
<p><strong>Recent Updates and Criticisms</strong></p>
<p>It was thus with some surprise and pleasure that I first found reference yesterday to Dirk Lewandowski&#8217;s and Phillip Mayr&#8217;s 2006 paper, <em><a title="Exploring the Academic Invisible Web" href="http://www.durchdenken.de/lewandowski/doc/LHT_Preprint.pdf">&#8220;Exploring the Academic Invisible Web&#8221;</a></em> [<em>Library Hi Tech</em> <strong>24</strong>(4), 529-539], that takes direct aim at the analysis in my original paper.  (Actually, they worked from the 2001 JEP version, but, as noted, the analysis is virtually identical to the original 2000 version.)  The authors pretty soundly criticize some of the methodology in my original paper and, for the most part, I agree with them.</p>
<p>My original analysis combined a manual evaluation of the &#8220;top 60&#8243; then-extant Web databases with an estimate of the total number of searchable databases (estimated at about 200,000, which they incorrectly cite as 100,000) and assessments of the mean size of each database based on a random sampling of those databases.  Lewandowski and Mayr note conceptual flaws in the analysis at these levels:</p>
<ul>
<li>First, by use of <u><em>mean</em></u> database size rather than <u><em>median</em></u> size, the size is overestimated,</li>
<li>Second, databases of questionable content to their interests in academic content (such as weather records from NOAA or Earth survey data by satellite) skewed my estimates upward, and</li>
<li>Third, my estimates were based on database size estimates (in GBs) and not internal record counts.</li>
</ul>
<p>On the other hand, the authors also criticized that my definition of deep content was too narrow, and overlooked certain content types such as PDFs now routinely indexed and retrieved on the surface Web.  We also have had uncertain, but tangible growth in standard search engine content &#8212; with the last cited amounts about 20 billion documents since Google and Yahoo! ceased their war of index numbers.</p>
<p>Though not really offering an alternative, full-blown analysis, the authors use the <a title="Gale Directory of Databases" href="http://library.dialog.com/bluesheets/html/bl0230.html">Gale Directory of Databases</a> to derive an alternative estimate of perhaps 20 billion to 100 billion documents on the deep Web of interest for academic purposes, which they later seem to imply also needs to be discounted by further percentages to get at &#8220;word-oriented&#8221; and &#8220;full-text or bibliographic&#8221; records that they deem appropriate.</p>
<p><strong>My Assessment of the Criticisms</strong></p>
<p>As noted, I generally agree with these criticisms.  For example, since the time of original publication, we have seen the power distribution nature of most things on the Internet, including popularity and traffic.  Exponential distributions will always result in overestimates using calculations based on <u><em>means</em></u> rather than <u><em>medians</em></u>.  I also think that meaningful content types were both overused (more database-like records) and underused (PDF content that is now routinely indexed) in my original analysis.</p>
<p>However, the authors&#8217; third criticism is patently wrong, since three different methods were used to estimate internal database record counts and the average sizes of each record they contained.  I would also have preferred a more careful reading by the authors of my actual paper, since there are numerous other citations in error and mis-characterizations.</p>
<p>On an epistemological level, I disagree with the authors&#8217; use of the term &#8220;invisible Web&#8221;, a label that we tried hard in the paper to overturn and that is fading as a current term of art. <a title="Internet Tutorials on the Deep Web" href="http://www.internettutorials.net/deepweb.html">Internet Tutorials</a> (initially, <a title="SUNY at Albany Libary" href="http://library.albany.edu/internet/deepweb.html">SUNY at Albany Library</a>) addresses this topic head-on, preferring &#8220;deep Web&#8221; on a number of compelling grounds, including that <em>&#8220;there is no such thing as recorded information that is invisible. Some information may be more of a challenge to find than others, but this is not the same as invisibility.&#8221;</em></p>
<p>Finally, I am not compelled by the author&#8217;s simplistic, alternate partial estimate based solely on the Gale database, but they readily acknowledge to not doing a full-blown analysis and to having different objectives in mind.  I agree with the authors in calling for a full, alternative analysis.  I think we all agree that is a non-trivial undertaking and could itself be subject to newer methodological pitfalls.</p>
<p><strong>So, What is the Quantitative Update?</strong></p>
<p>Within a couple of years after the initial publication of my paper, I suspected the &#8220;500 times&#8221; claim for the greater size of the deep Web in comparison to what is discoverable by search engines may have been too high.  Indeed, in later corporate literature and Powerpoint presentations, I backed off the initial 2000-2001 claims and began speaking in ranges from a &#8220;few times&#8221; to as high as &#8220;100 times&#8221; greater for the size of the deep Web.</p>
<p>In the last seven years, the only other <a title="'Deep Web' on Google Scholar" href="http://scholar.google.com/scholar?q=%22deep+web%22&#038;hl=en&#038;lr=&#038;btnG=Search">quantitative study of its kind</a> of which I am aware is documented in the paper, <a title="Chang et al., Structured Databases on the WEb" href="http://eagle.cs.uiuc.edu/pubs/2004/dwsurvey-sigmodrecord-chlpz-aug04.pdf"><em>&#8220;Structured Databases on the Web:  Observations and Implications,&#8221;</em></a> conducted by Chang <em>et al.</em> in April 2004 and published in the ACM SIGMOD, that estimated 330,000 deep Web sources with over 1.2 million query forms, reflecting a fast 3-7 times increase in 4 years from the date of my original paper.  Unlike the Lewandowski and Mayr partial analysis, this effort and others by that group suggests an even larger deep Web than my initial estimates!</p>
<p>The truth is, we didn&#8217;t know then &#8212; and we don&#8217;t know now &#8212; what the actual size of the dynamic Web truly is.  (And, aside from a sound bite, does it really matter?  It is huge by any measure.)  Heroic efforts such as these quantitative analyses or the still-more ambitious efforts of UC Berkeley&#8217;s SIM School on <a title="How Much Information?  2003" href="http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm"><em>How Much Information?</em></a> still have a role in helping to bound our understanding of information overload.  As long as such studies gain news traction, they will be pursued.  So, what might today&#8217;s story look like?</p>
<p>First, the methodological problems in my original analysis remain and (I believe today) resulted in overestimates.  Another factor today leading to a potential overestimate of the deep Web <em>v.</em> the surface Web would be the fact that much &#8220;deep&#8221; content is being more exposed to standard search engines, be it through Google&#8217;s Scholar, Yahoo!&#8217;s library relationships, individual site indexing and sharing such as through search appliances, and other &#8220;gray&#8221; factors we noted in our 2000-2001 studies.  These factors, and certainly more, act to narrow the difference between exposed search engine content (&#8221;surface Web&#8221;) and what we have termed the &#8220;deep Web.&#8221;</p>
<p>However, countering these facts are two newer trends.  First, foreign language content is growing at much higher rates and is often under-sampled.  Second, blogs and other democratized sources of content are exploding.  What these trends may be doing to content balances is, frankly, anyone&#8217;s guess.</p>
<p>So, while awareness of the qualitative nature of Web content has grown tremendously in the past near-decade, our quantitative understanding remains weak.  Improvements in technology and harvesting can now overcome earlier limits.</p>
<p>Perhaps there is another Ph.D. candidate or three out there that may want to tackle this question in a better (and more definitive) way.  According to Chang and Cho in their paper, <a title="Chang and Cho Paper" href="http://eagle.cs.uiuc.edu/pubs/2006/webaccesstutorial-sigmod06-cc-mar06.pdf"><em>&#8220;Accessing the Web: From Search to Integration,&#8221;</em></a> presented at the 2006 ACM SIGMOD International Conference on Management of Data in Chicago:</p>
<blockquote><p><em>On the other hand, for the deep Web, while the proliferation of structured sources has promised unlimited possibilities for more precise and aggregated access, it has also presented new challenges for realizing large scale and dynamic information integration. These issues are in essence related to data management, in a large scale, and thus present novel problems and interesting opportunities for our research community. </em></p></blockquote>
<p>Who knows?  For the right researcher with the right methodology, there may be a <a title="Science Magazine" href="http://www.sciencemag.org/"><em>Science</em></a> or <a title="Nature Journal" href="http://www.nature.com/nature/index.html"><em>Nature</em></a> paper in prospect!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Tutorial:  Internet Languages, Character Sets and Encodings</title>
		<link>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/</link>
		<comments>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/#comments</comments>
		<pubDate>Thu, 23 Mar 2006 15:42:29 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Adaptive Information]]></category>
		<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Information Automation]]></category>
		<category><![CDATA[OSINT (open source intel)]]></category>
		<category><![CDATA[Searching]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=195</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Tutorial:  Internet Languages, Character Sets and Encodings&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.subject=Semantic Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2006-03-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/&amp;rft.language=English"></span>
Author&#8217;s Note: This is an on line version of a paper that Mike Bergman recently released under the auspices of BrightPlanet Corp The citation for this effort is:
M.K. Bergman, &#8220;Tutorial:  Internet Languages, Character Sets and Encodings,&#8221; BrightPlanet Corporation Technical Documentation, March 2006, 13 pp.
 Click here to obtain a PDF copy of this posting (13 [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Tutorial:  Internet Languages, Character Sets and Encodings&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.subject=Semantic Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2006-03-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/&amp;rft.language=English"></span>
<p><strong><em>Author&#8217;s Note:</em></strong> This is an on line version of a paper that Mike Bergman recently released under the auspices of <a title="BrightPlanet Corporation" href="http://www.brightplanet.com">BrightPlanet Corp</a> The citation for this effort is:</p>
<p style="margin-left: 40px;"><em>M.K. Bergman, &#8220;Tutorial:  Internet Languages, Character Sets and Encodings,&#8221; BrightPlanet Corporation Technical Documentation, March 2006, 13 pp.</em></p>
<p><em><a href="wp-content/themes/ai3/files/2006Posts/InternationalizationTutorial060323.pdf"><img style="border: 0px solid ;" src="wp-content/themes/ai3/images/pdfdoc.gif" alt="Download PDF file" /></a> <a href="wp-content/themes/ai3/files/2006Posts/InternationalizationTutorial060323.pdf">Click here</a> to obtain a PDF copy of this posting (13 pp, 79 K)</em></p>
<p>Broad-scale, international open source harvesting from the Internet poses many challenges in use and translation of legacy encodings that have vexed academics and researchers for many years. Successfully addressing these challenges will only grow in importance as the relative percentage of international sites grows in relation to conventional English ones.</p>
<p>A major challenge in internationalization and foreign source support is &#8220;encoding.&#8221; Encodings specify the arbitrary assignment of numbers to the symbols (characters or ideograms) of the world&#8217;s written languages needed for electronic transfer and manipulation. One of the first encodings developed in the 1960s was ASCII (numerals, plus a-z; A-Z); others developed over time to deal with other unique characters and the many symbols of (particularly) the Asiatic languages.</p>
<p>Some languages have many character encodings and some encodings, for example Chinese and Japanese, have very complex systems for handling the large number of unique characters. Two different encodings can be incompatible by assigning the same number to two distinct symbols, or vice versa. So-called Unicode set out to consolidate many different encodings, all using separate code plans into a single system that could represent all written languages within the same character encoding. There are a few Unicode techniques and formats, the most common being UTF-8.</p>
<p>The Internet was originally developed via efforts in the United States funded by ARPA (later <a title="Freesoft's Internet History" href="http://www.freesoft.org/CIE/Topics/57.htm">DARPA</a>) and <a title="Freesoft's Internet History" href="http://www.freesoft.org/CIE/Topics/57.htm">NSF</a>, extending back to the 1960s. At the time of its commercial adoption in the early 1990s via the Word Wide Web protocols, it was almost entirely dominated by English by virtue of this U.S. heritage and the emergence of English as the <em>lingua franca </em>of the technical and research community.</p>
<p>However, with the maturation of the Internet as a global information repository and means for instantaneous e-commerce, today&#8217;s online community now approaches 1 billion users from all existing countries. The Internet has become increasingly multi-lingual.</p>
<p>Efficient and automated means to discover, search, query, retrieve and harvest content from across the Internet thus require an understanding of the source human languages in use and the means to encode them for electronic transfer and manipulation. This Tutorial provides a brief introduction to these topics.</p>
<p><strong>Internet Language Use</strong></p>
<p>Yoshiki Mikami, who runs the UN&#8217;s Language Observatory, has an interesting way to summarize the languages of the world. His updated figures, plus some other BrightPlanet statistics are:<a name="_ednref1" href="#_edn1">[1]</a></p>
<table style="width: 604px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" width="178" valign="bottom">
<p align="center"><strong>Category</strong></p>
</td>
<td style="background-color: #cccccc;" width="65" valign="bottom">
<p align="center"><strong>Number</strong></p>
</td>
<td style="background-color: #cccccc;" width="362" valign="bottom">
<p align="center"><strong>Source or Notes</strong></p>
</td>
</tr>
<tr>
<td width="178" valign="top">Active Human Languages</td>
<td width="65" valign="top">
<p align="right">6,912</p>
</td>
<td width="362" valign="top">from www.ethnologue.com</td>
</tr>
<tr>
<td width="178" valign="top">Language Identifiers</td>
<td width="65" valign="top">
<p align="right">440</p>
</td>
<td width="362" valign="top">based on ISO 639</td>
</tr>
<tr>
<td width="178" valign="top">Human Rights Translation</td>
<td width="65" valign="top">
<p align="right">327</p>
</td>
<td width="362" valign="top">UN&#8217;s Universal Declaration of Human Rights (UDHR)</td>
</tr>
<tr>
<td width="178" valign="top">Unicode Languages</td>
<td width="65" valign="top">
<p align="right">244</p>
</td>
<td width="362" valign="top">see text</td>
</tr>
<tr>
<td width="178" valign="top">DQM Languages</td>
<td width="65" valign="top">
<p align="right">140</p>
</td>
<td width="362" valign="top">estimate based on prevalence, BT input</td>
</tr>
<tr>
<td width="178" valign="top">Windows XP Languages</td>
<td width="65" valign="top">
<p align="right">123</p>
</td>
<td width="362" valign="top">from Microsoft</td>
</tr>
<tr>
<td width="178" valign="top">Basis Tech Languages</td>
<td width="65" valign="top">
<p align="right">40</p>
</td>
<td width="362" valign="top">based on Basis Tech&#8217;s Rosette Language Identifier (RLI)</td>
</tr>
<tr>
<td width="178" valign="top">Google Search Languages</td>
<td width="65" valign="top">
<p align="right">35</p>
</td>
<td width="362" valign="top">from Google</td>
</tr>
</tbody>
</table>
<p>There are nearly 7,000 living languages spoken today, though most have few speakers and many are becoming extinct. About 347 (or approximately 5%) of the world&#8217;s languages have at least one million speakers and account for 94% of the world&#8217;s population. Of this amount, 83 languages account for 80% of the world&#8217;s population, with just 8 languages with greater than 100 million speakers accounting for about 40% of total population. By contrast, the remaining 95% of languages are spoken by only 6% of the world&#8217;s people.<a name="_ednref2" href="#_edn2">[2]</a></p>
<p>This prevalence is shown by the fact that the UN&#8217;s Universal Declaration of Human Rights (UDHR) has only been translated into those languages generally with 1 million or more speakers.</p>
<p>The remaining items on the table above enumerate languages that can be represented electronically, or are &#8220;encoded.&#8221; More on this topic is provided below.</p>
<p>Of course, native language does not necessarily equate to Internet use, with English predominating because of multi-lingualism, plus the fact that richer countries or users within countries exhibit greater Internet access and use.</p>
<p>The most recent comprehensive figures for Internet language use and prevalence are from the Global Reach Web site for late 2004, with only percentage figures shown for ease of reading for those countries with greater than a 1.0% value:<a name="_ednref3" href="#_edn3">[3]</a> <a name="_ednref4" href="#_edn4">[4]</a></p>
<table style="width: 604px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" width="173" valign="bottom"><strong> </strong></td>
<td style="background-color: #cccccc;" width="100" valign="bottom">
<p align="center"><strong>Percent of</strong></p>
</td>
<td style="background-color: #cccccc;" colspan="2" width="165" valign="bottom">
<p align="center"><strong>2003 Internet Users</strong></p>
</td>
<td style="background-color: #cccccc;" colspan="2" width="165" valign="bottom">
<p align="center"><strong>Global Population</strong></p>
</td>
</tr>
<tr>
<td style="background-color: #cccccc;" width="173" valign="bottom"><strong> </strong></td>
<td style="background-color: #cccccc;" width="100" valign="bottom">
<p align="center"><strong>Web Pages</strong></p>
</td>
<td style="background-color: #cccccc;" width="83" valign="bottom">
<p align="center"><strong>Millions</strong></p>
</td>
<td style="background-color: #cccccc;" width="82" valign="bottom">
<p align="center"><strong>Percent</strong></p>
</td>
<td style="background-color: #cccccc;" width="83" valign="bottom">
<p align="center"><strong>Millions</strong></p>
</td>
<td style="background-color: #cccccc;" width="82" valign="bottom">
<p align="center"><strong>Percent</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"></td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>ENGLISH</strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>68.4%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>287.5 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>35.6%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>508 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>8.0%</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>NON-ENGLISH</strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>31.6%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>519.6 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>64.4%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>5,822 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>92.0%</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"></td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">EUROPEAN (non-English)</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Catalan</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.9</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">7</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Czech</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">4.2</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">12</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Dutch</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">13.5</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.7%</p>
</td>
<td width="83" valign="bottom">
<p align="right">20</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Finnish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.8</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">6</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">French</td>
<td width="100" valign="bottom">
<p align="right">3.0%</p>
</td>
<td width="83" valign="bottom">
<p align="right">28.0</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.5%</p>
</td>
<td width="83" valign="bottom">
<p align="right">77</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.2%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">German</td>
<td width="100" valign="bottom">
<p align="right">5.8%</p>
</td>
<td width="83" valign="bottom">
<p align="right">52.9</p>
</td>
<td width="82" valign="bottom">
<p align="right">6.6%</p>
</td>
<td width="83" valign="bottom">
<p align="right">100</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.6%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Greek</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.7</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">12</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Hungarian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">1.7</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">10</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Italian</td>
<td width="100" valign="bottom">
<p align="right">1.6%</p>
</td>
<td width="83" valign="bottom">
<p align="right">24.3</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.0%</p>
</td>
<td width="83" valign="bottom">
<p align="right">62</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.0%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Polish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">9.5</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.2%</p>
</td>
<td width="83" valign="bottom">
<p align="right">44</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Portuguese</td>
<td width="100" valign="bottom">
<p align="right">1.4%</p>
</td>
<td width="83" valign="bottom">
<p align="right">25.7</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.2%</p>
</td>
<td width="83" valign="bottom">
<p align="right">176</p>
</td>
<td width="82" valign="bottom">
<p align="right">2.8%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Romanian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.4</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">26</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Russian</td>
<td width="100" valign="bottom">
<p align="right">1.9%</p>
</td>
<td width="83" valign="bottom">
<p align="right">18.5</p>
</td>
<td width="82" valign="bottom">
<p align="right">2.3%</p>
</td>
<td width="83" valign="bottom">
<p align="right">167</p>
</td>
<td width="82" valign="bottom">
<p align="right">2.6%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Scandinavian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">14.6</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.8%</p>
</td>
<td width="83" valign="bottom">
<p align="right">20</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Danish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">3.5</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">5</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Icelandic</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">0.2</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">0</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Norwegian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.9</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">5</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Swedish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">7.9</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.0%</p>
</td>
<td width="83" valign="bottom">
<p align="right">9</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Serbo-Croatian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">1.0</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">20</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Slovak</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">1.2</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">6</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Slovenian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">0.8</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Spanish</td>
<td width="100" valign="bottom">
<p align="right">2.4%</p>
</td>
<td width="83" valign="bottom">
<p align="right">65.6</p>
</td>
<td width="82" valign="bottom">
<p align="right">8.1%</p>
</td>
<td width="83" valign="bottom">
<p align="right">350</p>
</td>
<td width="82" valign="bottom">
<p align="right">5.5%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Turkish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">5.8</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">67</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.1%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Ukrainian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">0.9</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">47</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>SUB-TOTAL</strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>18.7%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>279.0 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>34.6%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>1,230 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>19.4%</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"></td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">ASIAN LANGUAGES</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Arabic</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">10.5</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.3%</p>
</td>
<td width="83" valign="bottom">
<p align="right">300</p>
</td>
<td width="82" valign="bottom">
<p align="right">4.7%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Chinese</td>
<td width="100" valign="bottom">
<p align="right">3.9%</p>
</td>
<td width="83" valign="bottom">
<p align="right">102.6</p>
</td>
<td width="82" valign="bottom">
<p align="right">12.7%</p>
</td>
<td width="83" valign="bottom">
<p align="right">874</p>
</td>
<td width="82" valign="bottom">
<p align="right">13.8%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Farsi</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">3.4</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">64</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.0%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Hebrew</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">3.8</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">5</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Japanese</td>
<td width="100" valign="bottom">
<p align="right">5.9%</p>
</td>
<td width="83" valign="bottom">
<p align="right">69.7</p>
</td>
<td width="82" valign="bottom">
<p align="right">8.6%</p>
</td>
<td width="83" valign="bottom">
<p align="right">125</p>
</td>
<td width="82" valign="bottom">
<p align="right">2.0%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Korean</td>
<td width="100" valign="bottom">
<p align="right">1.3%</p>
</td>
<td width="83" valign="bottom">
<p align="right">29.9</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.7%</p>
</td>
<td width="83" valign="bottom">
<p align="right">78</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.2%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Malay</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">13.6</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.7%</p>
</td>
<td width="83" valign="bottom">
<p align="right">229</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.6%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Thai</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">4.9</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">46</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Vietnamese</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.2</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">68</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.1%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>SUB-TOTAL</strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>12.9%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>240.6 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>29.8%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>1,789 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>28.3%</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"></td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>TOTAL WORLD </strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>100.0%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>807.1 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>100.0%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>6,330 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>100.0%</strong></p>
</td>
</tr>
</tbody>
</table>
<p>English speakers have nearly a five-fold increase in Internet use than sheer population would suggest, and about an eight-fold increase in percent of English Web pages. However, various census efforts over time have shown a steady decrease in this English prevalence (data not shown.)</p>
<p>Virtually all European languages show higher Internet prevalence than actual population would suggest; Asian languages show the opposite. (African languages are even less represented than population would suggest; data not shown.)</p>
<p>Internet penetration appears to be about 20% of global population and growing rapidly. It is not unlikely that percentages of Web users and the pages the Web is written in will continue to converge to real population percentages. Thus, over time and likely within the foreseeable future, users and pages should more closely approximate the percentage figures shown in the rightmost column in the table above.</p>
<p><strong>Script Families</strong></p>
<p>Another useful starting point for understanding languages and their relation to the Internet is a 2005 UN publication from a World Summit on the Information Society. This 113 pp. report can be found at <a href="http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf">http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf</a>.<a name="_ednref5" href="#_edn5">[5]</a></p>
<p>Languages have both a <em>representational form</em> and <em>meaning</em>. The representational form is captured by scripts, fonts or ideograms. The meaning is captured by semantics. In an electronic medium, it is the representational form that must be transmitted accurately. Without accurate transmittal of the form, it is impossible to manipulate that language or understand its meaning.</p>
<p>Representational forms fit within what might be termed <em>script families</em>. Script families are not strictly alphabets or even exact character or symbol matches. They represent similar written approaches and some shared characteristics.</p>
<p>For example, English and its German and Romance language cousins share very similar, but not identical, alphabets. Similarly, the so-called CJK (Chinese, Japanese, Korean) share a similar approach to using ideograms without white space between tokens or punctuation.</p>
<p>At the highest level, the world&#8217;s languages may be clustered into these following script families:<a name="_Ref129331680"></a><a name="_ednref6" href="#_edn6">[6]</a></p>
<table style="width: 635px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" width="99" valign="bottom">
<p align="center"><strong>Script</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Latin</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Cyrillic</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Arabic</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Hanzi</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Indic</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Others*</strong></p>
</td>
</tr>
<tr>
<td width="99" valign="bottom">Million users</td>
<td width="89" valign="bottom">
<p align="right">2,238</p>
</td>
<td width="89" valign="bottom">
<p align="right">451</p>
</td>
<td width="89" valign="bottom">
<p align="right">462</p>
</td>
<td width="89" valign="bottom">
<p align="right">1,085</p>
</td>
<td width="89" valign="bottom">
<p align="right">807</p>
</td>
<td width="89" valign="bottom">
<p align="right">129</p>
</td>
</tr>
<tr>
<td width="99" valign="bottom">% of Total</td>
<td width="89" valign="bottom">
<p align="right">43.3%</p>
</td>
<td width="89" valign="bottom">
<p align="right">8.7%</p>
</td>
<td width="89" valign="bottom">
<p align="right">8.9%</p>
</td>
<td width="89" valign="bottom">
<p align="right">21.0%</p>
</td>
<td width="89" valign="bottom">
<p align="right">15.6%</p>
</td>
<td width="89" valign="bottom">
<p align="right">2.5%</p>
</td>
</tr>
<tr>
<td width="99" valign="top">Key languages</td>
<td width="89" valign="top">Romance (European) Slavic (some) Vietnamese Malay Indonesian</td>
<td width="89" valign="top">Russian Slavic (some) Kazakh Uzbek</td>
<td width="89" valign="top">Arabic Urdu Persian Pashtu</td>
<td width="89" valign="top">Chinese Japanese Korean</td>
<td width="89" valign="top">Hindi Tamil Bengali Punjabi Sanskrit Thai</td>
<td width="89" valign="top">Greek Hebrew Georgian Assyrian Armenian</td>
</tr>
</tbody>
</table>
<p>Note that English and the Romance languages fall within the Latin script family, the CJK within Hanzi. The &#8220;Other&#8221; category is a large catch-all, including Greek, Hebrew, many African languages, and others. However, besides Greek and Hebrew, most specific languages of global importance are included in the other named families. Also note that due to differences in sources, that total user counts do not equal earlier tables.</p>
<p><strong>Character Sets and Encodings</strong></p>
<p>In order to take advantage of the computer&#8217;s ability to manipulate text (<em>e.g.</em>, displaying, editing, sorting, searching and efficiently transmitting it), communications in a given language needs to be represented in some kind of encoding. Encodings specify the arbitrary assignment of numbers to the symbols of the world&#8217;s written languages. Two different encodings can be incompatible by assigning the same number to two distinct symbols, or vice versa. Thus, much of what the Internet offers with respect to linguistic diversity comes down to the encodings available for text.</p>
<p>The most widely used encoding is the American Standard Code for Information Interchange (ASCII), a code devised during the 1950s and 1960s under the auspices of the American National Standards Institute (ANSI) to standardize teletype technology. This encoding comprises 128 character assignments (7-bit) and is suitable primarily for North American English.<a name="_ednref6" href="#_edn6">[6]</a></p>
<p>Historically, other languages that did not fit in the ASCII 7-bit character set (a-z; A-Z) pretty much created their own character sets, sometimes with local standards acceptance and sometimes not. Some languages have many character encodings and some encodings, particularly Chinese and Japanese, have very complex systems for handling the large number of unique characters. Another difficult group is Hindi and the Indic language family, with speakers that number into the hundreds of millions. According to one University of Southern California researcher, almost every Hindi language web site has its own encoding.<a name="_ednref7" href="#_edn7">[7]</a></p>
<p>The Internet Assigned Names and Authority (IANA) organization maintains a master list of about 245 standard charset (&#8221;character set&#8221;) encodings and 550 associated aliases to the same used in one manner or another on the Internet.<a name="_ednref8" href="#_edn8">[8]</a> <a name="_ednref9" href="#_edn9">[9]</a> Some of these electronic encodings were created by large vendors with a stake in electronic transfer such as IBM, Microsoft, Apple and the like. Other standards result from recognized standards organizations such as ANSI, ISO, Unicode and the like. Many of these standards date back as far as the 1960s; many others are specific to certain countries.</p>
<p>Earlier estimates showed on the range of 40 to 250 languages per named encoding type. While no known estimate exists, if one assumes 100 languages for each of the IANA-listed encodings, there could be on the order of 25,000 or so specific language-encoding combinations possible on the Internet based on these &#8220;standards.&#8221; There are perhaps thousands of specific language encodings also extant.</p>
<p>Whatever the numbers, clearly it is critical to identify accurately the specific encoding and its associated language for any given Web page or database site. Without this accuracy, it is impossible to electronically query and understand the content.</p>
<p>As might be suspected, this topic too is very broad. For a very comprehensive starting point on all topics related to encodings and character sets, please see <strong>I18N</strong> (which stands for &#8220;internationalization&#8221;) <strong>Guy&#8217;s</strong> Web site at <a href="http://www.i18nguy.com/unicode/codepages.html">http://www.i18nguy.com/unicode/codepages.html</a>.</p>
<p><strong>Unicode</strong></p>
<p>In the late 1980s, there were two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Fortunately, the participants of both projects realized in 1991 that two different unified character sets did not make sense and they joined efforts to create a single code table, now referred to as Unicode. While both projects still exist and publish their respective standards independently, the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and closely coordinated.</p>
<p>Unicode sets out to consolidate many different encodings, all using separate code plans into a single system that can represent all written languages within the same character encoding. Unicode is first a set of code tables to assign integer numbers to characters, also called a code point. Unicode then has several methods for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes, generally prefixed by &#8220;UTF.&#8221;</p>
<p>In UTF-8, the most common method, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3 or up to 6 bytes. This method has the advantage that English text looks exactly the same in UTF-8 as it did in ASCII, so ASCII is a conforming sub-set. More unusual characters such as accented letters, Greek letters or CJK ideograms may need several bytes to store a single code point.</p>
<p>The traditional store-it-in-two-byte method for Unicode is called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits). There&#8217;s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero. There&#8217;s UTF-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes. There is also UTF-32 that stores the code point in 32 bits but requires more storage. Regardless, UTF-7, -8, -16, and -32 all have the property of being able to store any code point correctly.</p>
<p>BrightPlanet, along with many others, has adopted UTF-8 as the standard Unicode method to process all string data. There are tools available to convert nearly any existing character encoding into a UTF-8 encoded string. Java supplies these tools as does <a title="Basis Technology Corporation" href="http://www.basistech.com/">Basis Technolgy</a>, one of BrightPlanet&#8217;s partners in language processing.</p>
<p>As presently defined, Unicode supports about 245 common languages according to a variety of scripts (see notes at end of the table):<a name="_ednref10" href="#_edn10">[10]</a></p>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" valign="bottom">
<p align="center"><strong>Language </strong></p>
</td>
<td style="background-color: #cccccc;" valign="bottom">
<p align="center"><strong>Script(s) </strong></p>
</td>
<td style="background-color: #cccccc;" valign="bottom">
<p align="center"><strong>Some Country Notes</strong></p>
</td>
</tr>
<tr>
<td valign="bottom">Abaza</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Abkhaz</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Adygei</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Afrikaans</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ainu</td>
<td valign="bottom">Katakana, Latin</td>
<td valign="bottom">Japan</td>
</tr>
<tr>
<td valign="bottom">Aisor</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Albanian</td>
<td valign="bottom">Latin [2]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Altai</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Amharic</td>
<td valign="bottom">Ethiopic</td>
<td valign="bottom">Ethiopia</td>
</tr>
<tr>
<td valign="bottom">Amo</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Nigeria</td>
</tr>
<tr>
<td valign="bottom">Arabic</td>
<td valign="bottom">Arabic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Armenian</td>
<td valign="bottom">Armenian, Syriac [3]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Assamese</td>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Assyrian (modern)</td>
<td valign="bottom">Syriac</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Avar</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Awadhi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Aymara</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Peru</td>
</tr>
<tr>
<td valign="bottom">Azeri</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Azerbaijani</td>
<td valign="bottom">Arabic, Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Badaga</td>
<td valign="bottom">Tamil</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Bagheli</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Balear</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Balkar</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Balti</td>
<td valign="bottom">Devanagari, Balti [2]</td>
<td valign="bottom">India, Pakistan</td>
</tr>
<tr>
<td valign="bottom">Bashkir</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Basque</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Batak</td>
<td valign="bottom">Batak [1], Latin</td>
<td valign="bottom">Philippines, Indonesia</td>
</tr>
<tr>
<td valign="bottom">Batak toba</td>
<td valign="bottom">Batak [1], Latin</td>
<td valign="bottom">Indonesia</td>
</tr>
<tr>
<td valign="bottom">Bateri</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">(aka Bhatneri) India, Pakistan</td>
</tr>
<tr>
<td valign="bottom">Belarusian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom">(aka Belorussian, Belarusan)</td>
</tr>
<tr>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Bhili</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Bhojpuri</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Bihari</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Bosnian</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Bosnia-Herzegovina</td>
</tr>
<tr>
<td valign="bottom">Braj bhasha</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Breton</td>
<td valign="bottom">Latin</td>
<td valign="bottom">France</td>
</tr>
<tr>
<td valign="bottom">Bugis</td>
<td valign="bottom">Buginese [1]</td>
<td valign="bottom">Indonesia, Malaysia</td>
</tr>
<tr>
<td valign="bottom">Buhid</td>
<td valign="bottom">Buhid</td>
<td valign="bottom">Philippines</td>
</tr>
<tr>
<td valign="bottom">Bulgarian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Burmese</td>
<td valign="bottom">Myanmar</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Buryat</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Bahasa</td>
<td valign="bottom">Latin</td>
<td valign="bottom">(see Indonesian)</td>
</tr>
<tr>
<td valign="bottom">Catalan</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Chakma</td>
<td valign="bottom">Bengali, Chakma [1]</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Cham</td>
<td valign="bottom">Cham [1]</td>
<td valign="bottom">Cambodia, Thailand, Viet Nam</td>
</tr>
<tr>
<td valign="bottom">Chechen</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom">Georgia</td>
</tr>
<tr>
<td valign="bottom">Cherokee</td>
<td valign="bottom">Cherokee, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Chhattisgarhi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Chinese</td>
<td valign="bottom">Han</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Chukchi</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Chuvash</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Coptic</td>
<td valign="bottom">Greek</td>
<td valign="bottom">Egypt</td>
</tr>
<tr>
<td valign="bottom">Cornish</td>
<td valign="bottom">Latin</td>
<td valign="bottom">United Kingdom</td>
</tr>
<tr>
<td valign="bottom">Corsican</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Cree</td>
<td valign="bottom">Canadian Aboriginal Syllabics, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Croatian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Czech</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Danish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Dargwa</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Dhivehi</td>
<td valign="bottom">Thaana</td>
<td valign="bottom">Maldives</td>
</tr>
<tr>
<td valign="bottom">Dungan</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Dutch</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Dzongkha</td>
<td valign="bottom">Tibetan</td>
<td valign="bottom">Bhutan</td>
</tr>
<tr>
<td valign="bottom">Edo</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">English</td>
<td valign="bottom">Latin, Deseret [3], Shavian [3]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Esperanto</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Estonian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Evenki</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Faroese</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Faroe Islands</td>
</tr>
<tr>
<td valign="bottom">Farsi</td>
<td valign="bottom">Arabic</td>
<td valign="bottom">(aka Persian)</td>
</tr>
<tr>
<td valign="bottom">Fijian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Finnish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">French</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Frisian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Gaelic</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Gagauz</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Garhwali</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Garo</td>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Gascon</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ge&#8217;ez</td>
<td valign="bottom">Ethiopic</td>
<td valign="bottom">Eritrea, Ethiopia</td>
</tr>
<tr>
<td valign="bottom">Georgian</td>
<td valign="bottom">Georgian</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">German</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Gondi</td>
<td valign="bottom">Devanagari, Telugu</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Greek</td>
<td valign="bottom">Greek</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Guarani</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Gujarati</td>
<td valign="bottom">Gujarati</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Garshuni</td>
<td valign="bottom">Syriac</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hanunóo</td>
<td valign="bottom">Latin, Hanunóo</td>
<td valign="bottom">Philippines</td>
</tr>
<tr>
<td valign="bottom">Harauti</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Hausa</td>
<td valign="bottom">Latin, Arabic [3]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hawaiian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hebrew</td>
<td valign="bottom">Hebrew</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hindi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hmong</td>
<td valign="bottom">Latin, Hmong [1]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ho</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Hopi</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hungarian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ibibio</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Icelandic</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Indonesian</td>
<td valign="bottom">Arabic [3], Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ingush</td>
<td valign="bottom">Arabic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Inuktitut</td>
<td valign="bottom">Canadian Aboriginal Syllabics, Latin</td>
<td valign="bottom">Canada</td>
</tr>
<tr>
<td valign="bottom">Iñupiaq</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Greenland</td>
</tr>
<tr>
<td valign="bottom">Irish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Italian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Japanese</td>
<td valign="bottom">Han + Hiragana + Katakana</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Javanese</td>
<td valign="bottom">Latin, Javanese [1]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Judezmo</td>
<td valign="bottom">Hebrew</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kabardian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kachchi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Kalmyk</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kanauji</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Kankan</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Kannada</td>
<td valign="bottom">Kannada</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Kanuri</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Khanty</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Karachay</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Karakalpak</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Karelian</td>
<td valign="bottom">Latin, Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kashmiri</td>
<td valign="bottom">Devanagari, Arabic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kazakh</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Khakass</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Khamti</td>
<td valign="bottom">Myanmar</td>
<td valign="bottom">India, Myanmar</td>
</tr>
<tr>
<td valign="bottom">Khasi</td>
<td valign="bottom">Latin, Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Khmer</td>
<td valign="bottom">Khmer</td>
<td valign="bottom">Cambodia</td>
</tr>
<tr>
<td valign="bottom">Kirghiz</td>
<td valign="bottom">Arabic [3], Latin, Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Komi</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Konkan</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Korean</td>
<td valign="bottom">Hangul + Han</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Koryak</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kurdish</td>
<td valign="bottom">Arabic, Cyrillic, Latin</td>
<td valign="bottom">Iran, Iraq</td>
</tr>
<tr>
<td valign="bottom">Kuy</td>
<td valign="bottom">Thai</td>
<td valign="bottom">Cambodia, Laos, Thailand</td>
</tr>
<tr>
<td valign="bottom">Ladino</td>
<td valign="bottom">Hebrew</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Lak</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Lambadi</td>
<td valign="bottom">Telugu</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Lao</td>
<td valign="bottom">Lao</td>
<td valign="bottom">Laos</td>
</tr>
<tr>
<td valign="bottom">Lapp</td>
<td valign="bottom">Latin</td>
<td valign="bottom">(see Sami)</td>
</tr>
<tr>
<td valign="bottom">Latin</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Latvian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Lawa, eastern</td>
<td valign="bottom">Thai</td>
<td valign="bottom">Thailand</td>
</tr>
<tr>
<td valign="bottom">Lawa, western</td>
<td valign="bottom">Thai</td>
<td valign="bottom">China, Thailand</td>
</tr>
<tr>
<td valign="bottom">Lepcha</td>
<td valign="bottom">Lepcha [1]</td>
<td valign="bottom">Bhutan, India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Lezghian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Limbu</td>
<td valign="bottom">Devanagari, Limbu [1]</td>
<td valign="bottom">Bhutan, India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Lisu</td>
<td valign="bottom">Lisu (Fraser) [1], Latin</td>
<td valign="bottom">China</td>
</tr>
<tr>
<td valign="bottom">Lithuanian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Lushootseed</td>
<td valign="bottom">Latin</td>
<td valign="bottom">USA</td>
</tr>
<tr>
<td valign="bottom">Luxemburgish</td>
<td valign="bottom">Latin</td>
<td valign="bottom">(aka Luxembourgeois)</td>
</tr>
<tr>
<td valign="bottom">Macedonian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Malay</td>
<td valign="bottom">Arabic [3], Latin</td>
<td valign="bottom">Brunei, Indonesia, Malaysia</td>
</tr>
<tr>
<td valign="bottom">Malayalam</td>
<td valign="bottom">Malayalam</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Maldivian</td>
<td valign="bottom">Thaana</td>
<td valign="bottom">Maldives (See Dhivehi)</td>
</tr>
<tr>
<td valign="bottom">Maltese</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Manchu</td>
<td valign="bottom">Mongolian</td>
<td valign="bottom">China</td>
</tr>
<tr>
<td valign="bottom">Mansi</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Marathi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Mari</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Marwari</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Meitei</td>
<td valign="bottom">Meetai Mayek [1], Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Moldavian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Mon</td>
<td valign="bottom">Myanmar</td>
<td valign="bottom">Myanmar, Thailand</td>
</tr>
<tr>
<td valign="bottom">Mongolian</td>
<td valign="bottom">Mongolian, Cyrillic</td>
<td valign="bottom">China, Mongolia</td>
</tr>
<tr>
<td valign="bottom">Mordvin</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Mundari</td>
<td valign="bottom">Bengali, Devanagari</td>
<td valign="bottom">Bangladesh, India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Naga</td>
<td valign="bottom">Latin, Bengali</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Nanai</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Navajo</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Naxi</td>
<td valign="bottom">Naxi [2]</td>
<td valign="bottom">China</td>
</tr>
<tr>
<td valign="bottom">Nenets</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Nepali</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Netets</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Newari</td>
<td valign="bottom">Devanagari, Ranjana, Parachalit</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Nogai</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Norwegian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Oriya</td>
<td valign="bottom">Oriya</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Oromo</td>
<td valign="bottom">Ethiopic</td>
<td valign="bottom">Egypt, Ethiopia, Somalia</td>
</tr>
<tr>
<td valign="bottom">Ossetic</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Pali</td>
<td valign="bottom">Sinhala, Devanagari, Thai</td>
<td valign="bottom">India, Myanmar, Sri Lanka</td>
</tr>
<tr>
<td valign="bottom">Panjabi</td>
<td valign="bottom">Gurmukhi</td>
<td valign="bottom">India (see Punjabi)</td>
</tr>
<tr>
<td valign="bottom">Parsi-dari</td>
<td valign="bottom">Arabic</td>
<td valign="bottom">Afghanistan, Iran</td>
</tr>
<tr>
<td valign="bottom">Pashto</td>
<td valign="bottom">Arabic</td>
<td valign="bottom">Afghanistan</td>
</tr>
<tr>
<td valign="bottom">Polish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Portuguese</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Provençal</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Prussian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Punjabi</td>
<td valign="bottom">Gurmukhi</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Quechua</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Riang</td>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bangladesh, China, India, Myanmar</td>
</tr>
<tr>
<td valign="bottom">Romanian</td>
<td valign="bottom">Latin, Cyrillic [3]</td>
<td valign="bottom">(aka Rumanian)</td>
</tr>
<tr>
<td valign="bottom">Romany</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Russian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Sami</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Samaritan</td>
<td valign="bottom">Hebrew, Samaritan [1]</td>
<td valign="bottom">Israel</td>
</tr>
<tr>
<td valign="bottom">Sanskrit</td>
<td valign="bottom">Sinhala, Devanagari, etc.</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Santali</td>
<td valign="bottom">Devanagari, Bengali, Oriya, Ol Cemet [1]</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Selkup</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Serbian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Shan</td>
<td valign="bottom">Myanmar</td>
<td valign="bottom">China, Myanmar, Thailand</td>
</tr>
<tr>
<td valign="bottom">Sherpa</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Shona</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Shor</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Sindhi</td>
<td valign="bottom">Arabic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Sinhala</td>
<td valign="bottom">Sinhala</td>
<td valign="bottom">(aka Sinhalese) Sri Lanka</td>
</tr>
<tr>
<td valign="bottom">Slovak</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Slovenian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Somali</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Spanish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Swahili</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Swedish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Sylhetti</td>
<td valign="bottom">Siloti Nagri [1], Bengali</td>
<td valign="bottom">Bangladesh</td>
</tr>
<tr>
<td valign="bottom">Syriac</td>
<td valign="bottom">Syriac</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Swadaya</td>
<td valign="bottom">Syriac</td>
<td valign="bottom">(see Syriac)</td>
</tr>
<tr>
<td valign="bottom">Tabasaran</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tagalog</td>
<td valign="bottom">Latin, Tagalog</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tagbanwa</td>
<td valign="bottom">Latin, Tagbanwa</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tahitian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tajik</td>
<td valign="bottom">Arabic [3], Latin, Cyrillic (? Latin)</td>
<td valign="bottom">(aka Tadzhik)</td>
</tr>
<tr>
<td valign="bottom">Tamazight</td>
<td valign="bottom">Tifinagh [1], Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tamil</td>
<td valign="bottom">Tamil</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tat</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tatar</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Telugu</td>
<td valign="bottom">Telugu</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Thai</td>
<td valign="bottom">Thai</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tibetan</td>
<td valign="bottom">Tibetan</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tigre</td>
<td valign="bottom">Ethiopic</td>
<td valign="bottom">Eritrea, Sudan</td>
</tr>
<tr>
<td valign="bottom">Tsalagi</td>
<td valign="bottom">(see Cherokee)</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tulu</td>
<td valign="bottom">Kannada</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Turkish</td>
<td valign="bottom">Arabic [3], Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Turkmen</td>
<td valign="bottom">Arabic [3], Latin, Cyrillic (? Latin)</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tuva</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Turoyo</td>
<td valign="bottom">Syriac</td>
<td valign="bottom">(see Syriac)</td>
</tr>
<tr>
<td valign="bottom">Udekhe</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Udmurt</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Uighur</td>
<td valign="bottom">Arabic, Latin, Cyrillic, Uighur [1]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ukranian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Urdu</td>
<td valign="bottom">Arabic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Uzbek</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Valencian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Vietnamese</td>
<td valign="bottom">Latin, Chu Nom</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Yakut</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Yi</td>
<td valign="bottom">Yi, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Yiddish</td>
<td valign="bottom">Hebrew</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Yoruba</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom"></td>
<td valign="bottom"></td>
<td valign="bottom"></td>
</tr>
<tr>
<td colspan="2" valign="bottom"><em>[1] = Not yet encoded in Unicode.</em></td>
<td valign="bottom"></td>
</tr>
<tr>
<td colspan="3" valign="bottom"><em>[2] = Has one or more extinct or minor native script(s), not yet encoded.</em></td>
</tr>
<tr>
<td colspan="3" valign="bottom"><em>[3] = Formerly or historically used this script, now uses another.</em></td>
</tr>
</tbody>
</table>
<p>Notice most of these scripts fall into the seven broader script families such as Latin, Hanzi and Indic noted previously.</p>
<p>While more countries are adopting Unicode and sample results indicate increasing percentage use, it is by no means prevalent. In general, Europe has been slow to embrace Unicode with many legacy encodings still in use, perhaps Arabic sites have reached the 50% level, and Asian use is problematic.<a name="_ednref11" href="#_edn11">[11]</a> Other samples suggest that UTF-8 encoding is limited to 8.35% of all Asian Web pages. Some countries, such as Nepal, Vietnam and Tajikistan exceed 70% compliance, while others such Syria, Laos and Brunei are below even 1%.<a name="_ednref12" href="#_edn12">[12]</a> According to the Archive Pass project, which also used Basis Tech&#8217;s RLI for encoding detection, Chinese sites are dominated by GB-2312 and Big 5 encodings, while Shift-JIS is most common for Japanese.<a name="_ednref13" href="#_edn13">[13]</a></p>
<p><strong>Detecting and Communicating with Legacy Encodings</strong></p>
<p>There are two primary problems when dealing with non-Unicode encodings; identifying what the encoding is and converting that encoding to a Unicode string, usually UTF-8. Detecting the encoding is a difficult process, BasisTech&#8217;s RLI does an excellent job. Converting the non-Unicode string to a Unicode string can be easily done using tools available in the Java JDK, or using BasisTech&#8217;s RCLU library.</p>
<p>Basis Tech detects a combination of 96 language encoding pairs involving 40 different languages and 30 unique encoding types:</p>
<table style="width: 583px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" width="152" valign="bottom">
<p align="center"><strong>Language</strong></p>
</td>
<td style="background-color: #cccccc;" width="431" valign="bottom">
<p align="center"><strong>Encoding</strong></p>
</td>
</tr>
<tr>
<td width="152" valign="bottom">Albanian</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Arabic</td>
<td width="431" valign="bottom">UTF-8, Windows-1256, ISO-8859-6</td>
</tr>
<tr>
<td width="152" valign="bottom">Bahasa Indonesia</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Bahasa Malay</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Bulgarian</td>
<td width="431" valign="bottom">UTF-8, Windows-1251, ISO-8859-5, KOI8-R</td>
</tr>
<tr>
<td width="152" valign="bottom">Catalan</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Chinese</td>
<td width="431" valign="bottom">UTF-8, GB-2312, <span style="color: #ff0000;"><strong>HZ-GB-2312</strong></span>, ISO-2022-CN</td>
</tr>
<tr>
<td width="152" valign="bottom">Chinese</td>
<td width="431" valign="bottom">UTF-8, Big5</td>
</tr>
<tr>
<td width="152" valign="bottom">Croatian</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Czech</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Danish</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Dutch</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">English</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Estonian</td>
<td width="431" valign="bottom">UTF-8, Windows-1257</td>
</tr>
<tr>
<td width="152" valign="bottom">Farsi</td>
<td width="431" valign="bottom">UTF-8, Windows-1256</td>
</tr>
<tr>
<td width="152" valign="bottom">Finnish</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">French</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">German</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Greek</td>
<td width="431" valign="bottom">UTF-8, Windows-1253</td>
</tr>
<tr>
<td width="152" valign="bottom">Hebrew</td>
<td width="431" valign="bottom">UTF-8, Windows-1255</td>
</tr>
<tr>
<td width="152" valign="bottom">Hungarian</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Icelandic</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Italian</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Japanese</td>
<td width="431" valign="bottom">UTF-8, EUC-JP, ISO-2022-JP, Shift-JIS</td>
</tr>
<tr>
<td width="152" valign="bottom">Korean</td>
<td width="431" valign="bottom">UTF-8, EUC-KR, ISO-2022-KR</td>
</tr>
<tr>
<td width="152" valign="bottom">Latvian</td>
<td width="431" valign="bottom">UTF-8, Windows-1257</td>
</tr>
<tr>
<td width="152" valign="bottom">Lithuanian</td>
<td width="431" valign="bottom">UTF-8, Windows-1257</td>
</tr>
<tr>
<td width="152" valign="bottom">Norwegian</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Polish</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Portuguese</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Romanian</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Russian</td>
<td width="431" valign="bottom">UTF-8, Windows-1251, ISO-8859-5, IBM-866, KOI8-R, x-Mac-Cyrillic</td>
</tr>
<tr>
<td width="152" valign="bottom">Slovak</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Slovenian</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Spanish</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Swedish</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Tagalog</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Thai</td>
<td width="431" valign="bottom">UTF-8, <span style="color: #ff0000;"><strong>Windows-874</strong></span></td>
</tr>
<tr>
<td width="152" valign="bottom">Turkish</td>
<td width="431" valign="bottom">UTF-8, Windows-1254</td>
</tr>
<tr>
<td width="152" valign="bottom">Vietnamese</td>
<td width="431" valign="bottom">UTF-8, <span style="color: #ff0000;"><strong>VISCII</strong>, <strong>VPS</strong>, <strong>VIQR</strong>, <strong>TCVN</strong>, <strong>VNI</strong></span></td>
</tr>
</tbody>
</table>
<p><strong> </strong></p>
<p><strong> </strong></p>
<p>Java SDK encoding/decoding supports 22 basic European, and 125 other international forms (mostly non-European), for 147 total. If an ecoded form is not on this list, and not already Unicode, software can not talk to the site without special converters or adapters. See <a href="http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html">http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html</a></p>
<p>Of course, to avoid the classic &#8220;garbage in, garbage out&#8221; (GIGO) problem, accurate detection must be made of the source&#8217;s encoding type, there must be a converter for that type into a canonical, internal form (such as UTF-8), and another converter must exist for converting that canonical form back to the source&#8217;s original encoding. The combination of the existing Basis Tech RLI and the Java SDK produce a valid combination of 89 language/encoding pairs (with invalid combinations shown in <strong><span style="color: #ff0000;">Bold Red</span></strong> above.)</p>
<p>Fortunately, existing valid combinations appear to cover all prevalent languages and encoding types. Should gaps exist, specialized detectors and converters may be required. As events move forward, the family of Indic languages may be the most problematic for expansion with standard tools.</p>
<p><strong> </strong></p>
<p><strong> </strong></p>
<p><strong> </strong><strong>Actual Language Processing</strong></p>
<p><strong> </strong></p>
<p><strong> </strong></p>
<p>Encoding detection, and the resulting proper storage and language identification, is but the first essential step in actual language processing. Additional tools in morphological analysis or machine translation may need to be applied to address actual analyst needs. These tools are beyond the scope of this Tutorial.</p>
<p>The key point, however, is that all foreign language processing and analysis begins with accurate encoding detection and communicating with the host site in its original encoding. These steps are the <em>sine qua non</em> of language processing.</p>
<p><strong>Exemplar Methodology for Internet Foreign Language Support</strong></p>
<p>We can now take the information in this Tutorial and present what might be termed an exemplar methodology for initial language detection and processing. A schematic of this methodology is provided in the following diagram:</p>
<p><img src="wp-content/themes/ai3/images/2006Posts/060323a_LanguageHarvests.gif" alt="" width="481" height="349" /></p>
<p>This diagram shows that the actual encoding for an original Web document or search form must be detected, converted into a standard &#8220;canonical&#8221; form for internal storage, but talked to in its actual native encoding form when searching it. Encoding detection software and utilities within the Java SDK can aid this process greatly.</p>
<p>And, as the proliferation of languages and legacy forms grows, we can expect such utilities to embrace an ever-widening set of encodings.</p>
<hr size="1" /><a name="_edn1" href="#_ednref1">[1]</a> Yoshiki Mikami, &#8220;Language Observatory: Scanning Cyberspace for Languages,&#8221; from The Second Language Observatory Workshop, February 21-25, 2005, 41 pp. See <a href="http://gii.nagaokaut.ac.jp/~zaidi/Proceedings%20Online/01_Mikami.pdf">http://gii.nagaokaut.ac.jp/~zaidi/Proceedings%20Online/01_Mikami.pdf</a>. This is a generally useful reference on Internet and language. Please note some of the figures have been updated with more recent data.</p>
<p><a name="_edn2" href="#_ednref2">[2]</a> See <a href="http://www.ethnologue.com/ethno_docs/distribution.asp?by=size">http://www.ethnologue.com/ethno_docs/distribution.asp?by=size</a>.</p>
<p><a name="_edn3" href="#_ednref3">[3]</a> See <a href="http://global-reach.biz/globstats/index.php3">http://global-reach.biz/globstats/index.php3</a>. Also, for useful specific notes by country as well as orignial references, see <a href="http://global-reach.biz/globstats/refs.php3">http://global-reach.biz/globstats/refs.php3</a>.</p>
<p><a name="_edn4" href="#_ednref4">[4]</a> Another interesting language source with an emphasis on Latin family langguages is FUNREDES&#8217; 2005 study of languages and cultures. See <a href="http://funredes.org/LC/english/index.html">http://funredes.org/LC/english/index.html</a>.</p>
<p><a name="_edn5" href="#_ednref5">[5]</a> John Paolillo, Daniel Pimienta, Daniel Prado, et al. <em>Measuring Linguistic Diversity on the Internet,</em> a UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See <a href="http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf">http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf</a></p>
<p><a name="_edn6" href="#_ednref6">[6]</a> John Paolillo, &#8220;Language Diversity on the Internet,&#8221; pp. 43-89, in John Paolillo, Daniel Pimienta, Daniel Prado, et al.,<em> Measuring Linguistic Diversity on the Internet,</em> UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See <a href="http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf">http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf</a>.</p>
<p><a name="_edn7" href="#_ednref7">[7]</a> Information Sciences Institute press release, &#8220;USC Researchers Build Machine Translation System  &#8211;  and More &#8212; for Hindi in Less Than a Month,&#8221; June 30, 2003. See <a href="http://www.isi.edu/stories/60.html">http://www.isi.edu/stories/60.html</a>.</p>
<p><a name="_edn8" href="#_ednref8">[8]</a> <a href="http://www.iana.org/assignments/character-sets">http://www.iana.org/assignments/character-sets</a>.</p>
<p><a name="_edn9" href="#_ednref9">[9]</a> The actual values were calculated from Jukka &#8220;Yucca&#8221; Korpela&#8217;s informative Web site at <a href="http://www.cs.tut.fi/%7Ejkorpela/chars/sorted.html">http://www.cs.tut.fi/%7Ejkorpela/chars/sorted.html</a>.</p>
<p><a name="_edn10" href="#_ednref10">[10]</a> See <a href="http://www.unicode.org/onlinedat/languages-scripts.html">http://www.unicode.org/onlinedat/languages-scripts.html</a>.</p>
<p><a name="_edn11" href="#_ednref11">[11]</a> Pers. Comm., B. Margulies, Basis Technology, Inc., Feb. 27, 2006.</p>
<p><a name="_edn12" href="#_ednref12">[12]</a> Yoshika Mikami et al., &#8220;Language Diversity on the Internet: An Asian View,&#8221; pp. 91-103, in John Paolillo, Daniel Pimienta, Daniel Prado, et al.,<em> Measuring Linguistic Diversity on the Internet,</em> UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See <a href="http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf">http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf</a>.</p>
<p><a name="_edn13" href="#_ednref13">[13]</a> Archive Pass Project; see <a href="http://crawler.archive.org/cgi-bin/wiki.pl?ArchivePassProject">http://crawler.archive.org/cgi-bin/wiki.pl?ArchivePassProject</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Major Upgrade to Deep Query Manager Released</title>
		<link>http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/</link>
		<comments>http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/#comments</comments>
		<pubDate>Tue, 11 Oct 2005 16:03:45 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Information Automation]]></category>
		<category><![CDATA[OSINT (open source intel)]]></category>
		<category><![CDATA[Searching]]></category>
		<category><![CDATA[Software and Venture Capital]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=142</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Major Upgrade to Deep Query Manager Released&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.subject=Software and Venture Capital&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-10-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/&amp;rft.language=English"></span>
BrightPlanet has announced a major upgrade to its Deep Query Manager knowledge worker document platform.  According to its press release, the new  version achieves extreme scalability and broad internationalization and file format support, among other enhancements.  The DQM has added the ability to harvest and process up to 140 different foreign languages in more [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Major Upgrade to Deep Query Manager Released&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.subject=Software and Venture Capital&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-10-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/&amp;rft.language=English"></span>
<p><a title="BrightPlanet Home Page" href="http://www.brightplanet.com">BrightPlanet</a> has announced a major upgrade to its <a title="DQM v5 Product Release" href="http://www.brightplanet.com/news/dqm5_0.asp">Deep Query Manager knowledge worker document platform</a>.  According to its press release, the new  version achieves extreme scalability and broad internationalization and file format support, among other enhancements.  The DQM has added the ability to harvest and process up to 140 different foreign languages in more than 370 file formats plus new content export and system administration features.  The company also claims the new distributed architecture allows scalability into hundreds or thousands of users across multiple machines with the ability to handle incremental growth and expansions.</p>
<p>According to the company:</p>
<blockquote><p><em>The Deep Query Manager is a content discovery, harvesting, management and analysis platform used by knowledge workers to collaborate across the enterprise. It can access any document content &#8212; inside or outside the enterprise &#8212; with strengths in deep content harvesting from more than 70,000 unique searchable databases and automated techniques for the analyst to add new ones at will. The DQM&#8217;s differencing engine supports monitoring and tracking, among the product&#8217;s other powerful project management, data mining, reporting and analysis capabilities.</em></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
