<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>AI3:::Adaptive Information &#187; OSINT (open source intel)</title>
	<atom:link href="http://www.mkbergman.com/category/osint-open-source-intelligence/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mkbergman.com</link>
	<description>Mike Bergman on the semantic Web and structured Web</description>
	<lastBuildDate>Mon, 26 Jul 2010 05:31:20 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Open Source Center Breaking New Ground</title>
		<link>http://www.mkbergman.com/217/open-source-center-breaking-new-ground/</link>
		<comments>http://www.mkbergman.com/217/open-source-center-breaking-new-ground/#comments</comments>
		<pubDate>Sat, 29 Apr 2006 15:31:19 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[OSINT (open source intel)]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=217</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Open Source Center Breaking New Ground&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=OSINT (open source intel)&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2006-04-29&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/217/open-source-center-breaking-new-ground/&amp;rft.language=English"></span>
In a recent posting from the Signal online magazine, Robert Ackerman provides a fascinating overview of the mission and challenges of the new Open Source Center at the Office of the Director of National Intelligence.&#160; Signal is published by AFCEA, the Armed Forces Communications and Electronics Association.
]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Open Source Center Breaking New Ground&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=OSINT (open source intel)&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2006-04-29&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/217/open-source-center-breaking-new-ground/&amp;rft.language=English"></span>
<p>In a recent posting from the <em><a title="Signal Magazine" href="http://www.afcea.org/signal/default.asp">Signal</a></em> online magazine, Robert Ackerman provides a <a title="Intelligence Center Mines Open Sources" href="http://www.afcea.org/signal/articles/anmviewer.asp?a=1102&amp;z=115">fascinating overview</a> of the mission and challenges of the new Open Source Center at the Office of the Director of National Intelligence.&nbsp; <em>Signal</em> is published by <a href="http://www.afcea.org/">AFCEA</a>, the Armed Forces Communications and Electronics Association.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/217/open-source-center-breaking-new-ground/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tutorial:  Internet Languages, Character Sets and Encodings</title>
		<link>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/</link>
		<comments>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/#comments</comments>
		<pubDate>Thu, 23 Mar 2006 15:42:29 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Adaptive Information]]></category>
		<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Information Automation]]></category>
		<category><![CDATA[OSINT (open source intel)]]></category>
		<category><![CDATA[Searching]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=195</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Tutorial:  Internet Languages, Character Sets and Encodings&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.subject=Semantic Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2006-03-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/&amp;rft.language=English"></span>
Author&#8217;s Note: This is an on line version of a paper that Mike Bergman recently released under the auspices of BrightPlanet Corp The citation for this effort is:
M.K. Bergman, &#8220;Tutorial:  Internet Languages, Character Sets and Encodings,&#8221; BrightPlanet Corporation Technical Documentation, March 2006, 13 pp.
 Click here to obtain a PDF copy of this posting (13 [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Tutorial:  Internet Languages, Character Sets and Encodings&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.subject=Semantic Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2006-03-23&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/&amp;rft.language=English"></span>
<p><strong><em>Author&#8217;s Note:</em></strong> This is an on line version of a paper that Mike Bergman recently released under the auspices of <a title="BrightPlanet Corporation" href="http://www.brightplanet.com">BrightPlanet Corp</a> The citation for this effort is:</p>
<p style="margin-left: 40px;"><em>M.K. Bergman, &#8220;Tutorial:  Internet Languages, Character Sets and Encodings,&#8221; BrightPlanet Corporation Technical Documentation, March 2006, 13 pp.</em></p>
<p><em><a href="wp-content/themes/ai3/files/2006Posts/InternationalizationTutorial060323.pdf"><img style="border: 0px solid ;" src="wp-content/themes/ai3/images/pdfdoc.gif" alt="Download PDF file" /></a> <a href="wp-content/themes/ai3/files/2006Posts/InternationalizationTutorial060323.pdf">Click here</a> to obtain a PDF copy of this posting (13 pp, 79 K)</em></p>
<p>Broad-scale, international open source harvesting from the Internet poses many challenges in use and translation of legacy encodings that have vexed academics and researchers for many years. Successfully addressing these challenges will only grow in importance as the relative percentage of international sites grows in relation to conventional English ones.</p>
<p>A major challenge in internationalization and foreign source support is &#8220;encoding.&#8221; Encodings specify the arbitrary assignment of numbers to the symbols (characters or ideograms) of the world&#8217;s written languages needed for electronic transfer and manipulation. One of the first encodings developed in the 1960s was ASCII (numerals, plus a-z; A-Z); others developed over time to deal with other unique characters and the many symbols of (particularly) the Asiatic languages.</p>
<p>Some languages have many character encodings and some encodings, for example Chinese and Japanese, have very complex systems for handling the large number of unique characters. Two different encodings can be incompatible by assigning the same number to two distinct symbols, or vice versa. So-called Unicode set out to consolidate many different encodings, all using separate code plans into a single system that could represent all written languages within the same character encoding. There are a few Unicode techniques and formats, the most common being UTF-8.</p>
<p>The Internet was originally developed via efforts in the United States funded by ARPA (later <a title="Freesoft's Internet History" href="http://www.freesoft.org/CIE/Topics/57.htm">DARPA</a>) and <a title="Freesoft's Internet History" href="http://www.freesoft.org/CIE/Topics/57.htm">NSF</a>, extending back to the 1960s. At the time of its commercial adoption in the early 1990s via the Word Wide Web protocols, it was almost entirely dominated by English by virtue of this U.S. heritage and the emergence of English as the <em>lingua franca </em>of the technical and research community.</p>
<p>However, with the maturation of the Internet as a global information repository and means for instantaneous e-commerce, today&#8217;s online community now approaches 1 billion users from all existing countries. The Internet has become increasingly multi-lingual.</p>
<p>Efficient and automated means to discover, search, query, retrieve and harvest content from across the Internet thus require an understanding of the source human languages in use and the means to encode them for electronic transfer and manipulation. This Tutorial provides a brief introduction to these topics.</p>
<p><strong>Internet Language Use</strong></p>
<p>Yoshiki Mikami, who runs the UN&#8217;s Language Observatory, has an interesting way to summarize the languages of the world. His updated figures, plus some other BrightPlanet statistics are:<a name="_ednref1" href="#_edn1">[1]</a></p>
<table style="width: 604px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" width="178" valign="bottom">
<p align="center"><strong>Category</strong></p>
</td>
<td style="background-color: #cccccc;" width="65" valign="bottom">
<p align="center"><strong>Number</strong></p>
</td>
<td style="background-color: #cccccc;" width="362" valign="bottom">
<p align="center"><strong>Source or Notes</strong></p>
</td>
</tr>
<tr>
<td width="178" valign="top">Active Human Languages</td>
<td width="65" valign="top">
<p align="right">6,912</p>
</td>
<td width="362" valign="top">from www.ethnologue.com</td>
</tr>
<tr>
<td width="178" valign="top">Language Identifiers</td>
<td width="65" valign="top">
<p align="right">440</p>
</td>
<td width="362" valign="top">based on ISO 639</td>
</tr>
<tr>
<td width="178" valign="top">Human Rights Translation</td>
<td width="65" valign="top">
<p align="right">327</p>
</td>
<td width="362" valign="top">UN&#8217;s Universal Declaration of Human Rights (UDHR)</td>
</tr>
<tr>
<td width="178" valign="top">Unicode Languages</td>
<td width="65" valign="top">
<p align="right">244</p>
</td>
<td width="362" valign="top">see text</td>
</tr>
<tr>
<td width="178" valign="top">DQM Languages</td>
<td width="65" valign="top">
<p align="right">140</p>
</td>
<td width="362" valign="top">estimate based on prevalence, BT input</td>
</tr>
<tr>
<td width="178" valign="top">Windows XP Languages</td>
<td width="65" valign="top">
<p align="right">123</p>
</td>
<td width="362" valign="top">from Microsoft</td>
</tr>
<tr>
<td width="178" valign="top">Basis Tech Languages</td>
<td width="65" valign="top">
<p align="right">40</p>
</td>
<td width="362" valign="top">based on Basis Tech&#8217;s Rosette Language Identifier (RLI)</td>
</tr>
<tr>
<td width="178" valign="top">Google Search Languages</td>
<td width="65" valign="top">
<p align="right">35</p>
</td>
<td width="362" valign="top">from Google</td>
</tr>
</tbody>
</table>
<p>There are nearly 7,000 living languages spoken today, though most have few speakers and many are becoming extinct. About 347 (or approximately 5%) of the world&#8217;s languages have at least one million speakers and account for 94% of the world&#8217;s population. Of this amount, 83 languages account for 80% of the world&#8217;s population, with just 8 languages with greater than 100 million speakers accounting for about 40% of total population. By contrast, the remaining 95% of languages are spoken by only 6% of the world&#8217;s people.<a name="_ednref2" href="#_edn2">[2]</a></p>
<p>This prevalence is shown by the fact that the UN&#8217;s Universal Declaration of Human Rights (UDHR) has only been translated into those languages generally with 1 million or more speakers.</p>
<p>The remaining items on the table above enumerate languages that can be represented electronically, or are &#8220;encoded.&#8221; More on this topic is provided below.</p>
<p>Of course, native language does not necessarily equate to Internet use, with English predominating because of multi-lingualism, plus the fact that richer countries or users within countries exhibit greater Internet access and use.</p>
<p>The most recent comprehensive figures for Internet language use and prevalence are from the Global Reach Web site for late 2004, with only percentage figures shown for ease of reading for those countries with greater than a 1.0% value:<a name="_ednref3" href="#_edn3">[3]</a> <a name="_ednref4" href="#_edn4">[4]</a></p>
<table style="width: 604px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" width="173" valign="bottom"><strong> </strong></td>
<td style="background-color: #cccccc;" width="100" valign="bottom">
<p align="center"><strong>Percent of</strong></p>
</td>
<td style="background-color: #cccccc;" colspan="2" width="165" valign="bottom">
<p align="center"><strong>2003 Internet Users</strong></p>
</td>
<td style="background-color: #cccccc;" colspan="2" width="165" valign="bottom">
<p align="center"><strong>Global Population</strong></p>
</td>
</tr>
<tr>
<td style="background-color: #cccccc;" width="173" valign="bottom"><strong> </strong></td>
<td style="background-color: #cccccc;" width="100" valign="bottom">
<p align="center"><strong>Web Pages</strong></p>
</td>
<td style="background-color: #cccccc;" width="83" valign="bottom">
<p align="center"><strong>Millions</strong></p>
</td>
<td style="background-color: #cccccc;" width="82" valign="bottom">
<p align="center"><strong>Percent</strong></p>
</td>
<td style="background-color: #cccccc;" width="83" valign="bottom">
<p align="center"><strong>Millions</strong></p>
</td>
<td style="background-color: #cccccc;" width="82" valign="bottom">
<p align="center"><strong>Percent</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"></td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>ENGLISH</strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>68.4%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>287.5 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>35.6%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>508 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>8.0%</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>NON-ENGLISH</strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>31.6%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>519.6 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>64.4%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>5,822 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>92.0%</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"></td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">EUROPEAN (non-English)</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Catalan</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.9</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">7</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Czech</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">4.2</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">12</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Dutch</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">13.5</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.7%</p>
</td>
<td width="83" valign="bottom">
<p align="right">20</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Finnish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.8</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">6</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">French</td>
<td width="100" valign="bottom">
<p align="right">3.0%</p>
</td>
<td width="83" valign="bottom">
<p align="right">28.0</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.5%</p>
</td>
<td width="83" valign="bottom">
<p align="right">77</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.2%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">German</td>
<td width="100" valign="bottom">
<p align="right">5.8%</p>
</td>
<td width="83" valign="bottom">
<p align="right">52.9</p>
</td>
<td width="82" valign="bottom">
<p align="right">6.6%</p>
</td>
<td width="83" valign="bottom">
<p align="right">100</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.6%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Greek</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.7</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">12</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Hungarian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">1.7</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">10</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Italian</td>
<td width="100" valign="bottom">
<p align="right">1.6%</p>
</td>
<td width="83" valign="bottom">
<p align="right">24.3</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.0%</p>
</td>
<td width="83" valign="bottom">
<p align="right">62</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.0%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Polish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">9.5</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.2%</p>
</td>
<td width="83" valign="bottom">
<p align="right">44</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Portuguese</td>
<td width="100" valign="bottom">
<p align="right">1.4%</p>
</td>
<td width="83" valign="bottom">
<p align="right">25.7</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.2%</p>
</td>
<td width="83" valign="bottom">
<p align="right">176</p>
</td>
<td width="82" valign="bottom">
<p align="right">2.8%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Romanian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.4</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">26</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Russian</td>
<td width="100" valign="bottom">
<p align="right">1.9%</p>
</td>
<td width="83" valign="bottom">
<p align="right">18.5</p>
</td>
<td width="82" valign="bottom">
<p align="right">2.3%</p>
</td>
<td width="83" valign="bottom">
<p align="right">167</p>
</td>
<td width="82" valign="bottom">
<p align="right">2.6%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Scandinavian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">14.6</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.8%</p>
</td>
<td width="83" valign="bottom">
<p align="right">20</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Danish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">3.5</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">5</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Icelandic</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">0.2</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">0</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Norwegian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.9</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">5</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Swedish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">7.9</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.0%</p>
</td>
<td width="83" valign="bottom">
<p align="right">9</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Serbo-Croatian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">1.0</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">20</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Slovak</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">1.2</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">6</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Slovenian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">0.8</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Spanish</td>
<td width="100" valign="bottom">
<p align="right">2.4%</p>
</td>
<td width="83" valign="bottom">
<p align="right">65.6</p>
</td>
<td width="82" valign="bottom">
<p align="right">8.1%</p>
</td>
<td width="83" valign="bottom">
<p align="right">350</p>
</td>
<td width="82" valign="bottom">
<p align="right">5.5%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Turkish</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">5.8</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">67</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.1%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Ukrainian</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">0.9</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">47</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>SUB-TOTAL</strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>18.7%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>279.0 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>34.6%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>1,230 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>19.4%</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"></td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">ASIAN LANGUAGES</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Arabic</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">10.5</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.3%</p>
</td>
<td width="83" valign="bottom">
<p align="right">300</p>
</td>
<td width="82" valign="bottom">
<p align="right">4.7%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Chinese</td>
<td width="100" valign="bottom">
<p align="right">3.9%</p>
</td>
<td width="83" valign="bottom">
<p align="right">102.6</p>
</td>
<td width="82" valign="bottom">
<p align="right">12.7%</p>
</td>
<td width="83" valign="bottom">
<p align="right">874</p>
</td>
<td width="82" valign="bottom">
<p align="right">13.8%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Farsi</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">3.4</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">64</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.0%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Hebrew</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">3.8</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">5</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Japanese</td>
<td width="100" valign="bottom">
<p align="right">5.9%</p>
</td>
<td width="83" valign="bottom">
<p align="right">69.7</p>
</td>
<td width="82" valign="bottom">
<p align="right">8.6%</p>
</td>
<td width="83" valign="bottom">
<p align="right">125</p>
</td>
<td width="82" valign="bottom">
<p align="right">2.0%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Korean</td>
<td width="100" valign="bottom">
<p align="right">1.3%</p>
</td>
<td width="83" valign="bottom">
<p align="right">29.9</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.7%</p>
</td>
<td width="83" valign="bottom">
<p align="right">78</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.2%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Malay</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">13.6</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.7%</p>
</td>
<td width="83" valign="bottom">
<p align="right">229</p>
</td>
<td width="82" valign="bottom">
<p align="right">3.6%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom">Thai</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">4.9</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">46</p>
</td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom">Vietnamese</td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">2.2</p>
</td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom">
<p align="right">68</p>
</td>
<td width="82" valign="bottom">
<p align="right">1.1%</p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>SUB-TOTAL</strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>12.9%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>240.6 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>29.8%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>1,789 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>28.3%</strong></p>
</td>
</tr>
<tr>
<td width="173" valign="bottom"></td>
<td width="100" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
<td width="83" valign="bottom"></td>
<td width="82" valign="bottom"></td>
</tr>
<tr>
<td width="173" valign="bottom"><strong>TOTAL WORLD </strong></td>
<td width="100" valign="bottom">
<p align="right"><strong>100.0%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>807.1 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>100.0%</strong></p>
</td>
<td width="83" valign="bottom">
<p align="right"><strong>6,330 </strong></p>
</td>
<td width="82" valign="bottom">
<p align="right"><strong>100.0%</strong></p>
</td>
</tr>
</tbody>
</table>
<p>English speakers have nearly a five-fold increase in Internet use than sheer population would suggest, and about an eight-fold increase in percent of English Web pages. However, various census efforts over time have shown a steady decrease in this English prevalence (data not shown.)</p>
<p>Virtually all European languages show higher Internet prevalence than actual population would suggest; Asian languages show the opposite. (African languages are even less represented than population would suggest; data not shown.)</p>
<p>Internet penetration appears to be about 20% of global population and growing rapidly. It is not unlikely that percentages of Web users and the pages the Web is written in will continue to converge to real population percentages. Thus, over time and likely within the foreseeable future, users and pages should more closely approximate the percentage figures shown in the rightmost column in the table above.</p>
<p><strong>Script Families</strong></p>
<p>Another useful starting point for understanding languages and their relation to the Internet is a 2005 UN publication from a World Summit on the Information Society. This 113 pp. report can be found at <a href="http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf">http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf</a>.<a name="_ednref5" href="#_edn5">[5]</a></p>
<p>Languages have both a <em>representational form</em> and <em>meaning</em>. The representational form is captured by scripts, fonts or ideograms. The meaning is captured by semantics. In an electronic medium, it is the representational form that must be transmitted accurately. Without accurate transmittal of the form, it is impossible to manipulate that language or understand its meaning.</p>
<p>Representational forms fit within what might be termed <em>script families</em>. Script families are not strictly alphabets or even exact character or symbol matches. They represent similar written approaches and some shared characteristics.</p>
<p>For example, English and its German and Romance language cousins share very similar, but not identical, alphabets. Similarly, the so-called CJK (Chinese, Japanese, Korean) share a similar approach to using ideograms without white space between tokens or punctuation.</p>
<p>At the highest level, the world&#8217;s languages may be clustered into these following script families:<a name="_Ref129331680"></a><a name="_ednref6" href="#_edn6">[6]</a></p>
<table style="width: 635px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" width="99" valign="bottom">
<p align="center"><strong>Script</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Latin</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Cyrillic</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Arabic</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Hanzi</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Indic</strong></p>
</td>
<td style="background-color: #cccccc;" width="89" valign="bottom">
<p align="center"><strong>Others*</strong></p>
</td>
</tr>
<tr>
<td width="99" valign="bottom">Million users</td>
<td width="89" valign="bottom">
<p align="right">2,238</p>
</td>
<td width="89" valign="bottom">
<p align="right">451</p>
</td>
<td width="89" valign="bottom">
<p align="right">462</p>
</td>
<td width="89" valign="bottom">
<p align="right">1,085</p>
</td>
<td width="89" valign="bottom">
<p align="right">807</p>
</td>
<td width="89" valign="bottom">
<p align="right">129</p>
</td>
</tr>
<tr>
<td width="99" valign="bottom">% of Total</td>
<td width="89" valign="bottom">
<p align="right">43.3%</p>
</td>
<td width="89" valign="bottom">
<p align="right">8.7%</p>
</td>
<td width="89" valign="bottom">
<p align="right">8.9%</p>
</td>
<td width="89" valign="bottom">
<p align="right">21.0%</p>
</td>
<td width="89" valign="bottom">
<p align="right">15.6%</p>
</td>
<td width="89" valign="bottom">
<p align="right">2.5%</p>
</td>
</tr>
<tr>
<td width="99" valign="top">Key languages</td>
<td width="89" valign="top">Romance (European) Slavic (some) Vietnamese Malay Indonesian</td>
<td width="89" valign="top">Russian Slavic (some) Kazakh Uzbek</td>
<td width="89" valign="top">Arabic Urdu Persian Pashtu</td>
<td width="89" valign="top">Chinese Japanese Korean</td>
<td width="89" valign="top">Hindi Tamil Bengali Punjabi Sanskrit Thai</td>
<td width="89" valign="top">Greek Hebrew Georgian Assyrian Armenian</td>
</tr>
</tbody>
</table>
<p>Note that English and the Romance languages fall within the Latin script family, the CJK within Hanzi. The &#8220;Other&#8221; category is a large catch-all, including Greek, Hebrew, many African languages, and others. However, besides Greek and Hebrew, most specific languages of global importance are included in the other named families. Also note that due to differences in sources, that total user counts do not equal earlier tables.</p>
<p><strong>Character Sets and Encodings</strong></p>
<p>In order to take advantage of the computer&#8217;s ability to manipulate text (<em>e.g.</em>, displaying, editing, sorting, searching and efficiently transmitting it), communications in a given language needs to be represented in some kind of encoding. Encodings specify the arbitrary assignment of numbers to the symbols of the world&#8217;s written languages. Two different encodings can be incompatible by assigning the same number to two distinct symbols, or vice versa. Thus, much of what the Internet offers with respect to linguistic diversity comes down to the encodings available for text.</p>
<p>The most widely used encoding is the American Standard Code for Information Interchange (ASCII), a code devised during the 1950s and 1960s under the auspices of the American National Standards Institute (ANSI) to standardize teletype technology. This encoding comprises 128 character assignments (7-bit) and is suitable primarily for North American English.<a name="_ednref6" href="#_edn6">[6]</a></p>
<p>Historically, other languages that did not fit in the ASCII 7-bit character set (a-z; A-Z) pretty much created their own character sets, sometimes with local standards acceptance and sometimes not. Some languages have many character encodings and some encodings, particularly Chinese and Japanese, have very complex systems for handling the large number of unique characters. Another difficult group is Hindi and the Indic language family, with speakers that number into the hundreds of millions. According to one University of Southern California researcher, almost every Hindi language web site has its own encoding.<a name="_ednref7" href="#_edn7">[7]</a></p>
<p>The Internet Assigned Names and Authority (IANA) organization maintains a master list of about 245 standard charset (&#8221;character set&#8221;) encodings and 550 associated aliases to the same used in one manner or another on the Internet.<a name="_ednref8" href="#_edn8">[8]</a> <a name="_ednref9" href="#_edn9">[9]</a> Some of these electronic encodings were created by large vendors with a stake in electronic transfer such as IBM, Microsoft, Apple and the like. Other standards result from recognized standards organizations such as ANSI, ISO, Unicode and the like. Many of these standards date back as far as the 1960s; many others are specific to certain countries.</p>
<p>Earlier estimates showed on the range of 40 to 250 languages per named encoding type. While no known estimate exists, if one assumes 100 languages for each of the IANA-listed encodings, there could be on the order of 25,000 or so specific language-encoding combinations possible on the Internet based on these &#8220;standards.&#8221; There are perhaps thousands of specific language encodings also extant.</p>
<p>Whatever the numbers, clearly it is critical to identify accurately the specific encoding and its associated language for any given Web page or database site. Without this accuracy, it is impossible to electronically query and understand the content.</p>
<p>As might be suspected, this topic too is very broad. For a very comprehensive starting point on all topics related to encodings and character sets, please see <strong>I18N</strong> (which stands for &#8220;internationalization&#8221;) <strong>Guy&#8217;s</strong> Web site at <a href="http://www.i18nguy.com/unicode/codepages.html">http://www.i18nguy.com/unicode/codepages.html</a>.</p>
<p><strong>Unicode</strong></p>
<p>In the late 1980s, there were two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Fortunately, the participants of both projects realized in 1991 that two different unified character sets did not make sense and they joined efforts to create a single code table, now referred to as Unicode. While both projects still exist and publish their respective standards independently, the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and closely coordinated.</p>
<p>Unicode sets out to consolidate many different encodings, all using separate code plans into a single system that can represent all written languages within the same character encoding. Unicode is first a set of code tables to assign integer numbers to characters, also called a code point. Unicode then has several methods for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes, generally prefixed by &#8220;UTF.&#8221;</p>
<p>In UTF-8, the most common method, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3 or up to 6 bytes. This method has the advantage that English text looks exactly the same in UTF-8 as it did in ASCII, so ASCII is a conforming sub-set. More unusual characters such as accented letters, Greek letters or CJK ideograms may need several bytes to store a single code point.</p>
<p>The traditional store-it-in-two-byte method for Unicode is called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits). There&#8217;s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero. There&#8217;s UTF-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes. There is also UTF-32 that stores the code point in 32 bits but requires more storage. Regardless, UTF-7, -8, -16, and -32 all have the property of being able to store any code point correctly.</p>
<p>BrightPlanet, along with many others, has adopted UTF-8 as the standard Unicode method to process all string data. There are tools available to convert nearly any existing character encoding into a UTF-8 encoded string. Java supplies these tools as does <a title="Basis Technology Corporation" href="http://www.basistech.com/">Basis Technolgy</a>, one of BrightPlanet&#8217;s partners in language processing.</p>
<p>As presently defined, Unicode supports about 245 common languages according to a variety of scripts (see notes at end of the table):<a name="_ednref10" href="#_edn10">[10]</a></p>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" valign="bottom">
<p align="center"><strong>Language </strong></p>
</td>
<td style="background-color: #cccccc;" valign="bottom">
<p align="center"><strong>Script(s) </strong></p>
</td>
<td style="background-color: #cccccc;" valign="bottom">
<p align="center"><strong>Some Country Notes</strong></p>
</td>
</tr>
<tr>
<td valign="bottom">Abaza</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Abkhaz</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Adygei</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Afrikaans</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ainu</td>
<td valign="bottom">Katakana, Latin</td>
<td valign="bottom">Japan</td>
</tr>
<tr>
<td valign="bottom">Aisor</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Albanian</td>
<td valign="bottom">Latin [2]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Altai</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Amharic</td>
<td valign="bottom">Ethiopic</td>
<td valign="bottom">Ethiopia</td>
</tr>
<tr>
<td valign="bottom">Amo</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Nigeria</td>
</tr>
<tr>
<td valign="bottom">Arabic</td>
<td valign="bottom">Arabic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Armenian</td>
<td valign="bottom">Armenian, Syriac [3]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Assamese</td>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Assyrian (modern)</td>
<td valign="bottom">Syriac</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Avar</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Awadhi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Aymara</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Peru</td>
</tr>
<tr>
<td valign="bottom">Azeri</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Azerbaijani</td>
<td valign="bottom">Arabic, Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Badaga</td>
<td valign="bottom">Tamil</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Bagheli</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Balear</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Balkar</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Balti</td>
<td valign="bottom">Devanagari, Balti [2]</td>
<td valign="bottom">India, Pakistan</td>
</tr>
<tr>
<td valign="bottom">Bashkir</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Basque</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Batak</td>
<td valign="bottom">Batak [1], Latin</td>
<td valign="bottom">Philippines, Indonesia</td>
</tr>
<tr>
<td valign="bottom">Batak toba</td>
<td valign="bottom">Batak [1], Latin</td>
<td valign="bottom">Indonesia</td>
</tr>
<tr>
<td valign="bottom">Bateri</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">(aka Bhatneri) India, Pakistan</td>
</tr>
<tr>
<td valign="bottom">Belarusian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom">(aka Belorussian, Belarusan)</td>
</tr>
<tr>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Bhili</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Bhojpuri</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Bihari</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Bosnian</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Bosnia-Herzegovina</td>
</tr>
<tr>
<td valign="bottom">Braj bhasha</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Breton</td>
<td valign="bottom">Latin</td>
<td valign="bottom">France</td>
</tr>
<tr>
<td valign="bottom">Bugis</td>
<td valign="bottom">Buginese [1]</td>
<td valign="bottom">Indonesia, Malaysia</td>
</tr>
<tr>
<td valign="bottom">Buhid</td>
<td valign="bottom">Buhid</td>
<td valign="bottom">Philippines</td>
</tr>
<tr>
<td valign="bottom">Bulgarian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Burmese</td>
<td valign="bottom">Myanmar</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Buryat</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Bahasa</td>
<td valign="bottom">Latin</td>
<td valign="bottom">(see Indonesian)</td>
</tr>
<tr>
<td valign="bottom">Catalan</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Chakma</td>
<td valign="bottom">Bengali, Chakma [1]</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Cham</td>
<td valign="bottom">Cham [1]</td>
<td valign="bottom">Cambodia, Thailand, Viet Nam</td>
</tr>
<tr>
<td valign="bottom">Chechen</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom">Georgia</td>
</tr>
<tr>
<td valign="bottom">Cherokee</td>
<td valign="bottom">Cherokee, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Chhattisgarhi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Chinese</td>
<td valign="bottom">Han</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Chukchi</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Chuvash</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Coptic</td>
<td valign="bottom">Greek</td>
<td valign="bottom">Egypt</td>
</tr>
<tr>
<td valign="bottom">Cornish</td>
<td valign="bottom">Latin</td>
<td valign="bottom">United Kingdom</td>
</tr>
<tr>
<td valign="bottom">Corsican</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Cree</td>
<td valign="bottom">Canadian Aboriginal Syllabics, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Croatian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Czech</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Danish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Dargwa</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Dhivehi</td>
<td valign="bottom">Thaana</td>
<td valign="bottom">Maldives</td>
</tr>
<tr>
<td valign="bottom">Dungan</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Dutch</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Dzongkha</td>
<td valign="bottom">Tibetan</td>
<td valign="bottom">Bhutan</td>
</tr>
<tr>
<td valign="bottom">Edo</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">English</td>
<td valign="bottom">Latin, Deseret [3], Shavian [3]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Esperanto</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Estonian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Evenki</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Faroese</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Faroe Islands</td>
</tr>
<tr>
<td valign="bottom">Farsi</td>
<td valign="bottom">Arabic</td>
<td valign="bottom">(aka Persian)</td>
</tr>
<tr>
<td valign="bottom">Fijian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Finnish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">French</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Frisian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Gaelic</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Gagauz</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Garhwali</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Garo</td>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Gascon</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ge&#8217;ez</td>
<td valign="bottom">Ethiopic</td>
<td valign="bottom">Eritrea, Ethiopia</td>
</tr>
<tr>
<td valign="bottom">Georgian</td>
<td valign="bottom">Georgian</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">German</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Gondi</td>
<td valign="bottom">Devanagari, Telugu</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Greek</td>
<td valign="bottom">Greek</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Guarani</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Gujarati</td>
<td valign="bottom">Gujarati</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Garshuni</td>
<td valign="bottom">Syriac</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hanunóo</td>
<td valign="bottom">Latin, Hanunóo</td>
<td valign="bottom">Philippines</td>
</tr>
<tr>
<td valign="bottom">Harauti</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Hausa</td>
<td valign="bottom">Latin, Arabic [3]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hawaiian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hebrew</td>
<td valign="bottom">Hebrew</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hindi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hmong</td>
<td valign="bottom">Latin, Hmong [1]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ho</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Hopi</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Hungarian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ibibio</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Icelandic</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Indonesian</td>
<td valign="bottom">Arabic [3], Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ingush</td>
<td valign="bottom">Arabic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Inuktitut</td>
<td valign="bottom">Canadian Aboriginal Syllabics, Latin</td>
<td valign="bottom">Canada</td>
</tr>
<tr>
<td valign="bottom">Iñupiaq</td>
<td valign="bottom">Latin</td>
<td valign="bottom">Greenland</td>
</tr>
<tr>
<td valign="bottom">Irish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Italian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Japanese</td>
<td valign="bottom">Han + Hiragana + Katakana</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Javanese</td>
<td valign="bottom">Latin, Javanese [1]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Judezmo</td>
<td valign="bottom">Hebrew</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kabardian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kachchi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Kalmyk</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kanauji</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Kankan</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Kannada</td>
<td valign="bottom">Kannada</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Kanuri</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Khanty</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Karachay</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Karakalpak</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Karelian</td>
<td valign="bottom">Latin, Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kashmiri</td>
<td valign="bottom">Devanagari, Arabic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kazakh</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Khakass</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Khamti</td>
<td valign="bottom">Myanmar</td>
<td valign="bottom">India, Myanmar</td>
</tr>
<tr>
<td valign="bottom">Khasi</td>
<td valign="bottom">Latin, Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Khmer</td>
<td valign="bottom">Khmer</td>
<td valign="bottom">Cambodia</td>
</tr>
<tr>
<td valign="bottom">Kirghiz</td>
<td valign="bottom">Arabic [3], Latin, Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Komi</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Konkan</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Korean</td>
<td valign="bottom">Hangul + Han</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Koryak</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Kurdish</td>
<td valign="bottom">Arabic, Cyrillic, Latin</td>
<td valign="bottom">Iran, Iraq</td>
</tr>
<tr>
<td valign="bottom">Kuy</td>
<td valign="bottom">Thai</td>
<td valign="bottom">Cambodia, Laos, Thailand</td>
</tr>
<tr>
<td valign="bottom">Ladino</td>
<td valign="bottom">Hebrew</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Lak</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Lambadi</td>
<td valign="bottom">Telugu</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Lao</td>
<td valign="bottom">Lao</td>
<td valign="bottom">Laos</td>
</tr>
<tr>
<td valign="bottom">Lapp</td>
<td valign="bottom">Latin</td>
<td valign="bottom">(see Sami)</td>
</tr>
<tr>
<td valign="bottom">Latin</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Latvian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Lawa, eastern</td>
<td valign="bottom">Thai</td>
<td valign="bottom">Thailand</td>
</tr>
<tr>
<td valign="bottom">Lawa, western</td>
<td valign="bottom">Thai</td>
<td valign="bottom">China, Thailand</td>
</tr>
<tr>
<td valign="bottom">Lepcha</td>
<td valign="bottom">Lepcha [1]</td>
<td valign="bottom">Bhutan, India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Lezghian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Limbu</td>
<td valign="bottom">Devanagari, Limbu [1]</td>
<td valign="bottom">Bhutan, India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Lisu</td>
<td valign="bottom">Lisu (Fraser) [1], Latin</td>
<td valign="bottom">China</td>
</tr>
<tr>
<td valign="bottom">Lithuanian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Lushootseed</td>
<td valign="bottom">Latin</td>
<td valign="bottom">USA</td>
</tr>
<tr>
<td valign="bottom">Luxemburgish</td>
<td valign="bottom">Latin</td>
<td valign="bottom">(aka Luxembourgeois)</td>
</tr>
<tr>
<td valign="bottom">Macedonian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Malay</td>
<td valign="bottom">Arabic [3], Latin</td>
<td valign="bottom">Brunei, Indonesia, Malaysia</td>
</tr>
<tr>
<td valign="bottom">Malayalam</td>
<td valign="bottom">Malayalam</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Maldivian</td>
<td valign="bottom">Thaana</td>
<td valign="bottom">Maldives (See Dhivehi)</td>
</tr>
<tr>
<td valign="bottom">Maltese</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Manchu</td>
<td valign="bottom">Mongolian</td>
<td valign="bottom">China</td>
</tr>
<tr>
<td valign="bottom">Mansi</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Marathi</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Mari</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Marwari</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Meitei</td>
<td valign="bottom">Meetai Mayek [1], Bengali</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Moldavian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Mon</td>
<td valign="bottom">Myanmar</td>
<td valign="bottom">Myanmar, Thailand</td>
</tr>
<tr>
<td valign="bottom">Mongolian</td>
<td valign="bottom">Mongolian, Cyrillic</td>
<td valign="bottom">China, Mongolia</td>
</tr>
<tr>
<td valign="bottom">Mordvin</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Mundari</td>
<td valign="bottom">Bengali, Devanagari</td>
<td valign="bottom">Bangladesh, India, Nepal</td>
</tr>
<tr>
<td valign="bottom">Naga</td>
<td valign="bottom">Latin, Bengali</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Nanai</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Navajo</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Naxi</td>
<td valign="bottom">Naxi [2]</td>
<td valign="bottom">China</td>
</tr>
<tr>
<td valign="bottom">Nenets</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Nepali</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Netets</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Newari</td>
<td valign="bottom">Devanagari, Ranjana, Parachalit</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Nogai</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Norwegian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Oriya</td>
<td valign="bottom">Oriya</td>
<td valign="bottom">Bangladesh, India</td>
</tr>
<tr>
<td valign="bottom">Oromo</td>
<td valign="bottom">Ethiopic</td>
<td valign="bottom">Egypt, Ethiopia, Somalia</td>
</tr>
<tr>
<td valign="bottom">Ossetic</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Pali</td>
<td valign="bottom">Sinhala, Devanagari, Thai</td>
<td valign="bottom">India, Myanmar, Sri Lanka</td>
</tr>
<tr>
<td valign="bottom">Panjabi</td>
<td valign="bottom">Gurmukhi</td>
<td valign="bottom">India (see Punjabi)</td>
</tr>
<tr>
<td valign="bottom">Parsi-dari</td>
<td valign="bottom">Arabic</td>
<td valign="bottom">Afghanistan, Iran</td>
</tr>
<tr>
<td valign="bottom">Pashto</td>
<td valign="bottom">Arabic</td>
<td valign="bottom">Afghanistan</td>
</tr>
<tr>
<td valign="bottom">Polish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Portuguese</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Provençal</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Prussian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Punjabi</td>
<td valign="bottom">Gurmukhi</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Quechua</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Riang</td>
<td valign="bottom">Bengali</td>
<td valign="bottom">Bangladesh, China, India, Myanmar</td>
</tr>
<tr>
<td valign="bottom">Romanian</td>
<td valign="bottom">Latin, Cyrillic [3]</td>
<td valign="bottom">(aka Rumanian)</td>
</tr>
<tr>
<td valign="bottom">Romany</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Russian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Sami</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Samaritan</td>
<td valign="bottom">Hebrew, Samaritan [1]</td>
<td valign="bottom">Israel</td>
</tr>
<tr>
<td valign="bottom">Sanskrit</td>
<td valign="bottom">Sinhala, Devanagari, etc.</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Santali</td>
<td valign="bottom">Devanagari, Bengali, Oriya, Ol Cemet [1]</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Selkup</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Serbian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Shan</td>
<td valign="bottom">Myanmar</td>
<td valign="bottom">China, Myanmar, Thailand</td>
</tr>
<tr>
<td valign="bottom">Sherpa</td>
<td valign="bottom">Devanagari</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Shona</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Shor</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Sindhi</td>
<td valign="bottom">Arabic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Sinhala</td>
<td valign="bottom">Sinhala</td>
<td valign="bottom">(aka Sinhalese) Sri Lanka</td>
</tr>
<tr>
<td valign="bottom">Slovak</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Slovenian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Somali</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Spanish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Swahili</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Swedish</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Sylhetti</td>
<td valign="bottom">Siloti Nagri [1], Bengali</td>
<td valign="bottom">Bangladesh</td>
</tr>
<tr>
<td valign="bottom">Syriac</td>
<td valign="bottom">Syriac</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Swadaya</td>
<td valign="bottom">Syriac</td>
<td valign="bottom">(see Syriac)</td>
</tr>
<tr>
<td valign="bottom">Tabasaran</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tagalog</td>
<td valign="bottom">Latin, Tagalog</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tagbanwa</td>
<td valign="bottom">Latin, Tagbanwa</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tahitian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tajik</td>
<td valign="bottom">Arabic [3], Latin, Cyrillic (? Latin)</td>
<td valign="bottom">(aka Tadzhik)</td>
</tr>
<tr>
<td valign="bottom">Tamazight</td>
<td valign="bottom">Tifinagh [1], Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tamil</td>
<td valign="bottom">Tamil</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tat</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tatar</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Telugu</td>
<td valign="bottom">Telugu</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Thai</td>
<td valign="bottom">Thai</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tibetan</td>
<td valign="bottom">Tibetan</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tigre</td>
<td valign="bottom">Ethiopic</td>
<td valign="bottom">Eritrea, Sudan</td>
</tr>
<tr>
<td valign="bottom">Tsalagi</td>
<td valign="bottom">(see Cherokee)</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tulu</td>
<td valign="bottom">Kannada</td>
<td valign="bottom">India</td>
</tr>
<tr>
<td valign="bottom">Turkish</td>
<td valign="bottom">Arabic [3], Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Turkmen</td>
<td valign="bottom">Arabic [3], Latin, Cyrillic (? Latin)</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Tuva</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Turoyo</td>
<td valign="bottom">Syriac</td>
<td valign="bottom">(see Syriac)</td>
</tr>
<tr>
<td valign="bottom">Udekhe</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Udmurt</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Uighur</td>
<td valign="bottom">Arabic, Latin, Cyrillic, Uighur [1]</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Ukranian</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Urdu</td>
<td valign="bottom">Arabic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Uzbek</td>
<td valign="bottom">Cyrillic, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Valencian</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Vietnamese</td>
<td valign="bottom">Latin, Chu Nom</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Yakut</td>
<td valign="bottom">Cyrillic</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Yi</td>
<td valign="bottom">Yi, Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Yiddish</td>
<td valign="bottom">Hebrew</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom">Yoruba</td>
<td valign="bottom">Latin</td>
<td valign="bottom"></td>
</tr>
<tr>
<td valign="bottom"></td>
<td valign="bottom"></td>
<td valign="bottom"></td>
</tr>
<tr>
<td colspan="2" valign="bottom"><em>[1] = Not yet encoded in Unicode.</em></td>
<td valign="bottom"></td>
</tr>
<tr>
<td colspan="3" valign="bottom"><em>[2] = Has one or more extinct or minor native script(s), not yet encoded.</em></td>
</tr>
<tr>
<td colspan="3" valign="bottom"><em>[3] = Formerly or historically used this script, now uses another.</em></td>
</tr>
</tbody>
</table>
<p>Notice most of these scripts fall into the seven broader script families such as Latin, Hanzi and Indic noted previously.</p>
<p>While more countries are adopting Unicode and sample results indicate increasing percentage use, it is by no means prevalent. In general, Europe has been slow to embrace Unicode with many legacy encodings still in use, perhaps Arabic sites have reached the 50% level, and Asian use is problematic.<a name="_ednref11" href="#_edn11">[11]</a> Other samples suggest that UTF-8 encoding is limited to 8.35% of all Asian Web pages. Some countries, such as Nepal, Vietnam and Tajikistan exceed 70% compliance, while others such Syria, Laos and Brunei are below even 1%.<a name="_ednref12" href="#_edn12">[12]</a> According to the Archive Pass project, which also used Basis Tech&#8217;s RLI for encoding detection, Chinese sites are dominated by GB-2312 and Big 5 encodings, while Shift-JIS is most common for Japanese.<a name="_ednref13" href="#_edn13">[13]</a></p>
<p><strong>Detecting and Communicating with Legacy Encodings</strong></p>
<p>There are two primary problems when dealing with non-Unicode encodings; identifying what the encoding is and converting that encoding to a Unicode string, usually UTF-8. Detecting the encoding is a difficult process, BasisTech&#8217;s RLI does an excellent job. Converting the non-Unicode string to a Unicode string can be easily done using tools available in the Java JDK, or using BasisTech&#8217;s RCLU library.</p>
<p>Basis Tech detects a combination of 96 language encoding pairs involving 40 different languages and 30 unique encoding types:</p>
<table style="width: 583px;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #cccccc;" width="152" valign="bottom">
<p align="center"><strong>Language</strong></p>
</td>
<td style="background-color: #cccccc;" width="431" valign="bottom">
<p align="center"><strong>Encoding</strong></p>
</td>
</tr>
<tr>
<td width="152" valign="bottom">Albanian</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Arabic</td>
<td width="431" valign="bottom">UTF-8, Windows-1256, ISO-8859-6</td>
</tr>
<tr>
<td width="152" valign="bottom">Bahasa Indonesia</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Bahasa Malay</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Bulgarian</td>
<td width="431" valign="bottom">UTF-8, Windows-1251, ISO-8859-5, KOI8-R</td>
</tr>
<tr>
<td width="152" valign="bottom">Catalan</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Chinese</td>
<td width="431" valign="bottom">UTF-8, GB-2312, <span style="color: #ff0000;"><strong>HZ-GB-2312</strong></span>, ISO-2022-CN</td>
</tr>
<tr>
<td width="152" valign="bottom">Chinese</td>
<td width="431" valign="bottom">UTF-8, Big5</td>
</tr>
<tr>
<td width="152" valign="bottom">Croatian</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Czech</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Danish</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Dutch</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">English</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Estonian</td>
<td width="431" valign="bottom">UTF-8, Windows-1257</td>
</tr>
<tr>
<td width="152" valign="bottom">Farsi</td>
<td width="431" valign="bottom">UTF-8, Windows-1256</td>
</tr>
<tr>
<td width="152" valign="bottom">Finnish</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">French</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">German</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Greek</td>
<td width="431" valign="bottom">UTF-8, Windows-1253</td>
</tr>
<tr>
<td width="152" valign="bottom">Hebrew</td>
<td width="431" valign="bottom">UTF-8, Windows-1255</td>
</tr>
<tr>
<td width="152" valign="bottom">Hungarian</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Icelandic</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Italian</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Japanese</td>
<td width="431" valign="bottom">UTF-8, EUC-JP, ISO-2022-JP, Shift-JIS</td>
</tr>
<tr>
<td width="152" valign="bottom">Korean</td>
<td width="431" valign="bottom">UTF-8, EUC-KR, ISO-2022-KR</td>
</tr>
<tr>
<td width="152" valign="bottom">Latvian</td>
<td width="431" valign="bottom">UTF-8, Windows-1257</td>
</tr>
<tr>
<td width="152" valign="bottom">Lithuanian</td>
<td width="431" valign="bottom">UTF-8, Windows-1257</td>
</tr>
<tr>
<td width="152" valign="bottom">Norwegian</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Polish</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Portuguese</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Romanian</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Russian</td>
<td width="431" valign="bottom">UTF-8, Windows-1251, ISO-8859-5, IBM-866, KOI8-R, x-Mac-Cyrillic</td>
</tr>
<tr>
<td width="152" valign="bottom">Slovak</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Slovenian</td>
<td width="431" valign="bottom">UTF-8, Windows-1250</td>
</tr>
<tr>
<td width="152" valign="bottom">Spanish</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Swedish</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Tagalog</td>
<td width="431" valign="bottom">UTF-8, Windows-1252</td>
</tr>
<tr>
<td width="152" valign="bottom">Thai</td>
<td width="431" valign="bottom">UTF-8, <span style="color: #ff0000;"><strong>Windows-874</strong></span></td>
</tr>
<tr>
<td width="152" valign="bottom">Turkish</td>
<td width="431" valign="bottom">UTF-8, Windows-1254</td>
</tr>
<tr>
<td width="152" valign="bottom">Vietnamese</td>
<td width="431" valign="bottom">UTF-8, <span style="color: #ff0000;"><strong>VISCII</strong>, <strong>VPS</strong>, <strong>VIQR</strong>, <strong>TCVN</strong>, <strong>VNI</strong></span></td>
</tr>
</tbody>
</table>
<p><strong> </strong></p>
<p><strong> </strong></p>
<p>Java SDK encoding/decoding supports 22 basic European, and 125 other international forms (mostly non-European), for 147 total. If an ecoded form is not on this list, and not already Unicode, software can not talk to the site without special converters or adapters. See <a href="http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html">http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html</a></p>
<p>Of course, to avoid the classic &#8220;garbage in, garbage out&#8221; (GIGO) problem, accurate detection must be made of the source&#8217;s encoding type, there must be a converter for that type into a canonical, internal form (such as UTF-8), and another converter must exist for converting that canonical form back to the source&#8217;s original encoding. The combination of the existing Basis Tech RLI and the Java SDK produce a valid combination of 89 language/encoding pairs (with invalid combinations shown in <strong><span style="color: #ff0000;">Bold Red</span></strong> above.)</p>
<p>Fortunately, existing valid combinations appear to cover all prevalent languages and encoding types. Should gaps exist, specialized detectors and converters may be required. As events move forward, the family of Indic languages may be the most problematic for expansion with standard tools.</p>
<p><strong> </strong></p>
<p><strong> </strong></p>
<p><strong> </strong><strong>Actual Language Processing</strong></p>
<p><strong> </strong></p>
<p><strong> </strong></p>
<p>Encoding detection, and the resulting proper storage and language identification, is but the first essential step in actual language processing. Additional tools in morphological analysis or machine translation may need to be applied to address actual analyst needs. These tools are beyond the scope of this Tutorial.</p>
<p>The key point, however, is that all foreign language processing and analysis begins with accurate encoding detection and communicating with the host site in its original encoding. These steps are the <em>sine qua non</em> of language processing.</p>
<p><strong>Exemplar Methodology for Internet Foreign Language Support</strong></p>
<p>We can now take the information in this Tutorial and present what might be termed an exemplar methodology for initial language detection and processing. A schematic of this methodology is provided in the following diagram:</p>
<p><img src="wp-content/themes/ai3/images/2006Posts/060323a_LanguageHarvests.gif" alt="" width="481" height="349" /></p>
<p>This diagram shows that the actual encoding for an original Web document or search form must be detected, converted into a standard &#8220;canonical&#8221; form for internal storage, but talked to in its actual native encoding form when searching it. Encoding detection software and utilities within the Java SDK can aid this process greatly.</p>
<p>And, as the proliferation of languages and legacy forms grows, we can expect such utilities to embrace an ever-widening set of encodings.</p>
<hr size="1" /><a name="_edn1" href="#_ednref1">[1]</a> Yoshiki Mikami, &#8220;Language Observatory: Scanning Cyberspace for Languages,&#8221; from The Second Language Observatory Workshop, February 21-25, 2005, 41 pp. See <a href="http://gii.nagaokaut.ac.jp/~zaidi/Proceedings%20Online/01_Mikami.pdf">http://gii.nagaokaut.ac.jp/~zaidi/Proceedings%20Online/01_Mikami.pdf</a>. This is a generally useful reference on Internet and language. Please note some of the figures have been updated with more recent data.</p>
<p><a name="_edn2" href="#_ednref2">[2]</a> See <a href="http://www.ethnologue.com/ethno_docs/distribution.asp?by=size">http://www.ethnologue.com/ethno_docs/distribution.asp?by=size</a>.</p>
<p><a name="_edn3" href="#_ednref3">[3]</a> See <a href="http://global-reach.biz/globstats/index.php3">http://global-reach.biz/globstats/index.php3</a>. Also, for useful specific notes by country as well as orignial references, see <a href="http://global-reach.biz/globstats/refs.php3">http://global-reach.biz/globstats/refs.php3</a>.</p>
<p><a name="_edn4" href="#_ednref4">[4]</a> Another interesting language source with an emphasis on Latin family langguages is FUNREDES&#8217; 2005 study of languages and cultures. See <a href="http://funredes.org/LC/english/index.html">http://funredes.org/LC/english/index.html</a>.</p>
<p><a name="_edn5" href="#_ednref5">[5]</a> John Paolillo, Daniel Pimienta, Daniel Prado, et al. <em>Measuring Linguistic Diversity on the Internet,</em> a UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See <a href="http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf">http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf</a></p>
<p><a name="_edn6" href="#_ednref6">[6]</a> John Paolillo, &#8220;Language Diversity on the Internet,&#8221; pp. 43-89, in John Paolillo, Daniel Pimienta, Daniel Prado, et al.,<em> Measuring Linguistic Diversity on the Internet,</em> UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See <a href="http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf">http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf</a>.</p>
<p><a name="_edn7" href="#_ednref7">[7]</a> Information Sciences Institute press release, &#8220;USC Researchers Build Machine Translation System  &#8211;  and More &#8212; for Hindi in Less Than a Month,&#8221; June 30, 2003. See <a href="http://www.isi.edu/stories/60.html">http://www.isi.edu/stories/60.html</a>.</p>
<p><a name="_edn8" href="#_ednref8">[8]</a> <a href="http://www.iana.org/assignments/character-sets">http://www.iana.org/assignments/character-sets</a>.</p>
<p><a name="_edn9" href="#_ednref9">[9]</a> The actual values were calculated from Jukka &#8220;Yucca&#8221; Korpela&#8217;s informative Web site at <a href="http://www.cs.tut.fi/%7Ejkorpela/chars/sorted.html">http://www.cs.tut.fi/%7Ejkorpela/chars/sorted.html</a>.</p>
<p><a name="_edn10" href="#_ednref10">[10]</a> See <a href="http://www.unicode.org/onlinedat/languages-scripts.html">http://www.unicode.org/onlinedat/languages-scripts.html</a>.</p>
<p><a name="_edn11" href="#_ednref11">[11]</a> Pers. Comm., B. Margulies, Basis Technology, Inc., Feb. 27, 2006.</p>
<p><a name="_edn12" href="#_ednref12">[12]</a> Yoshika Mikami et al., &#8220;Language Diversity on the Internet: An Asian View,&#8221; pp. 91-103, in John Paolillo, Daniel Pimienta, Daniel Prado, et al.,<em> Measuring Linguistic Diversity on the Internet,</em> UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See <a href="http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf">http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf</a>.</p>
<p><a name="_edn13" href="#_ednref13">[13]</a> Archive Pass Project; see <a href="http://crawler.archive.org/cgi-bin/wiki.pl?ArchivePassProject">http://crawler.archive.org/cgi-bin/wiki.pl?ArchivePassProject</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Large-scale Intelligence Analysis</title>
		<link>http://www.mkbergman.com/171/large-scale-intelligence-analysis/</link>
		<comments>http://www.mkbergman.com/171/large-scale-intelligence-analysis/#comments</comments>
		<pubDate>Fri, 02 Dec 2005 15:05:24 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Adaptive Information]]></category>
		<category><![CDATA[OSINT (open source intel)]]></category>
		<category><![CDATA[Searching]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=171</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Large-scale Intelligence Analysis&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-12-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/171/large-scale-intelligence-analysis/&amp;rft.language=English"></span>
A recent article by Cheryl Gerber in the November 21 issue of Military Information Technology online on &#34;Smart Searching&#34;&#160; provides a useful overview of issues and leading vendors dealing with large-scale issues of content search and discovery.&#160; Some of the useful vendors covered in this article include:&#160;&#160; Endeca Technologies, Basis Technology, Inxight Software, Insightful, Attensity, [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Large-scale Intelligence Analysis&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-12-02&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/171/large-scale-intelligence-analysis/&amp;rft.language=English"></span>
<p>A recent article by Cheryl Gerber in the November 21 issue of Military Information Technology online on <a href="http://www.military-information-technology.com/article.cfm?DocID=1245">&quot;Smart Searching&quot;</a>&nbsp; provides a useful overview of issues and leading vendors dealing with large-scale issues of content search and discovery.&nbsp; Some of the useful vendors covered in this article include:&nbsp;&nbsp; <a title="Endeca Technologies" href="http://endeca.com/">Endeca Technologies</a>, <a title="Basis Technology" href="http://www.basistech.com/">Basis Technology</a>, <a title="Inxight Software" href="http://www.inxight.com/">Inxight Software</a>, <a title="Insightful Software" href="http://www.insightful.com/">Insightful</a>, <a title="Attensity Corporation" href="http://www.attensity.com/">Attensity</a>, <a title="Convera Corporation" href="http://www.convera.com/">Convera</a>, <a title="NetOwl from SRA International" href="http://www.netowl.com/">SRA International (NetOwl)</a>, <a title="ClearForest Corporation" href="http://www.clearforest.com/">ClearForest </a> and <a title="BrightPlanet Corporation" href="http://www.brightplanet.com">BrightPlanet</a>.</p>
<p>The focus of these efforts in the Defense Intelligence Agency is characterized by Gerber as:</p>
<blockquote><p>The unique requirements of defense intelligence analysts are refining search technology down from mass production, with its vast and sometimes trivial outcomes, to more guided, dynamic navigation able to produce results that are both inclusive and relevant. </p>
<p>As one of the largest collectors of information on the planet, the Defense Intelligence Agency (DIA) is responsible for amassing and analyzing all sources of human intelligence in the field from all information types in a multitude of languages. </p>
<p>&quot;This forces us to deal with huge volumes of data. It&apos;s an enormous challenge,&quot; said a senior DIA official.</p>
<p>The task is indeed a massive one. Sources of intelligence in the field include feeds from UAVs, intelligence, surveillance and reconnaissance data from a vast array of sensors and overhead platforms, signal intelligence, satellites, film and video, not to mention all the data from the open source world. &quot;We need to manage all that data and make it available as quickly as possible to analysts,&quot; the DIA official said. </p>
</blockquote>
<p>The intel community, as with others forming in the commercial sector, is also relying on community standards for metadata transfer and management.&nbsp; In the case of the DIA, these standards are being provided by the <a title="Intelligence Community Metadata Working Group" href="https://www.icmwg.org/">Intelligence Community Metadata Working Group (ICMWG</a>), which is charged with establishing standards for the tagging of all data used by DIA systems.</p>
<p>In the article, BrightPlanet&#8217;s Duncan Witte commented on the importance of having the abilities to &quot;organize, manage and distribute the huge volume of information as well. You need various specialties that allow collaboration with teammates and effective distribution of information.&quot; </p>
<p>This article again affirms that the federal intelligence community continues to assume the lead in large-scale content discovery and evaluation.</p>
<p>As the article notes, the  DIA maintains a steady push toward technology improvement.  &quot;We try to do the best we can with the volumes. In-house we have a lot of expertise on search algorithms and text analysis. But we need to do a better job of combing through the massive volumes of information to find that which is interesting and nontrivial in a way that leads to knowledge discovery. We need better information retrieval through machine understanding of the semantic meaning of text, regardless of language,&quot; the DIA official said.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/171/large-scale-intelligence-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Major Upgrade to Deep Query Manager Released</title>
		<link>http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/</link>
		<comments>http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/#comments</comments>
		<pubDate>Tue, 11 Oct 2005 16:03:45 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Information Automation]]></category>
		<category><![CDATA[OSINT (open source intel)]]></category>
		<category><![CDATA[Searching]]></category>
		<category><![CDATA[Software and Venture Capital]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=142</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Major Upgrade to Deep Query Manager Released&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.subject=Software and Venture Capital&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-10-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/&amp;rft.language=English"></span>
BrightPlanet has announced a major upgrade to its Deep Query Manager knowledge worker document platform.  According to its press release, the new  version achieves extreme scalability and broad internationalization and file format support, among other enhancements.  The DQM has added the ability to harvest and process up to 140 different foreign languages in more [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Major Upgrade to Deep Query Manager Released&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Information Automation&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.subject=Software and Venture Capital&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-10-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/&amp;rft.language=English"></span>
<p><a title="BrightPlanet Home Page" href="http://www.brightplanet.com">BrightPlanet</a> has announced a major upgrade to its <a title="DQM v5 Product Release" href="http://www.brightplanet.com/news/dqm5_0.asp">Deep Query Manager knowledge worker document platform</a>.  According to its press release, the new  version achieves extreme scalability and broad internationalization and file format support, among other enhancements.  The DQM has added the ability to harvest and process up to 140 different foreign languages in more than 370 file formats plus new content export and system administration features.  The company also claims the new distributed architecture allows scalability into hundreds or thousands of users across multiple machines with the ability to handle incremental growth and expansions.</p>
<p>According to the company:</p>
<blockquote><p><em>The Deep Query Manager is a content discovery, harvesting, management and analysis platform used by knowledge workers to collaborate across the enterprise. It can access any document content &#8212; inside or outside the enterprise &#8212; with strengths in deep content harvesting from more than 70,000 unique searchable databases and automated techniques for the analyst to add new ones at will. The DQM&#8217;s differencing engine supports monitoring and tracking, among the product&#8217;s other powerful project management, data mining, reporting and analysis capabilities.</em></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/142/major-upgrade-to-deep-query-manager%e2%84%a2-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SOCom Awards New OSINT Center</title>
		<link>http://www.mkbergman.com/141/socom-awards-new-osint-center/</link>
		<comments>http://www.mkbergman.com/141/socom-awards-new-osint-center/#comments</comments>
		<pubDate>Tue, 11 Oct 2005 14:51:46 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[OSINT (open source intel)]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=141</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=SOCom Awards New OSINT Center&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=OSINT (open source intel)&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-10-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/141/socom-awards-new-osint-center/&amp;rft.language=English"></span>
According to Paul de la Garza of the St. Petersburg Times, the Special Operations Command (SOCom) based out of MacDill Air Force Base in Tampa Bay will be opening a new Joint Intelligence Operations Center (JIOC) in St. Petersburg to process open source intelligence (OSINT) in support of the global war on terrorism.
The Center was [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=SOCom Awards New OSINT Center&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=OSINT (open source intel)&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-10-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/141/socom-awards-new-osint-center/&amp;rft.language=English"></span>
<p>According to Paul de la Garza of the <a href="http://www.sptimes.com/2005/10/08/Tampabay/SOCom_to_open_site_in.shtml" title="St. Petersburg Times">St. Petersburg Times</a>, the Special Operations Command (SOCom) based out of MacDill Air Force Base in Tampa Bay will be opening a new Joint Intelligence Operations Center (JIOC) in St. Petersburg to process open source intelligence (<a href="http://en.wikipedia.org/wiki/OSINT" title="Open Source Intelligence">OSINT</a>) in support of the global war on terrorism.</p>
<p>The Center was announced by Rep.C.W. Bill Young, R-Indian Shores (FL) on October 7.&nbsp; Rep. Young said that <a href="http://www.blackbirdtech.com/" title="Blackbird Technologies">Blackbird Technologies</a> of Virginia was awarded the $27-million contract to operate the Center, which will contain 60 people to conduct OSINT.&nbsp; Young, chairman of the Defense Appropriations Subcommittee, said the center will open soon but declined to offer more details because of the classified nature of the facility.</p>
<p>According to de la Garza, SOCom has played a pivotal role in the war on terror since 9/11, with an increase in budget from $3.8-billion to $6.6-billion and an increase in staff from 6,000 to 51,441. In March, President Bush signed a directive that puts SOCom in charge of &quot;synchronizing&quot; the war on terror. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/141/socom-awards-new-osint-center/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Recommended Intel Blog and Disappointing &#8220;Intelligence Search&#8221;</title>
		<link>http://www.mkbergman.com/131/recommended-intel-blog-and-disappointing-intelligence-search/</link>
		<comments>http://www.mkbergman.com/131/recommended-intel-blog-and-disappointing-intelligence-search/#comments</comments>
		<pubDate>Sat, 01 Oct 2005 16:36:34 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Blogs and Blogging]]></category>
		<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[OSINT (open source intel)]]></category>
		<category><![CDATA[Searching]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=131</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Recommended Intel Blog and Disappointing &#8220;Intelligence Search&#8221;&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Blogs and Blogging&amp;rft.subject=Deep Web&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-10-01&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/131/recommended-intel-blog-and-disappointing-intelligence-search/&amp;rft.language=English"></span>
In my daytime life at BrightPlanet we do a lot of work for the intelligence community that we really can&#8217;t say anything about.  However, I recently came across a blog that I have been monitoring (and am still the only subscriber to on Bloglines) called Intelligence and Technology and National Security that I have been [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Recommended Intel Blog and Disappointing &#8220;Intelligence Search&#8221;&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Blogs and Blogging&amp;rft.subject=Deep Web&amp;rft.subject=OSINT (open source intel)&amp;rft.subject=Searching&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2005-10-01&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/131/recommended-intel-blog-and-disappointing-intelligence-search/&amp;rft.language=English"></span>
<p>In my daytime life at <a title="BrightPlanet Home Page" href="http://www.brightplanet.com/">BrightPlanet</a> we do a lot of work for the intelligence community that we really can&#8217;t say anything about.  However, I recently came across a blog that I have been monitoring (and am still the only subscriber to on <a title="Bloglines Home Page." href="http://www.bloglines.com">Bloglines</a>) called <a href="http://tstohk.blogspot.com/" target="_blank">Intelligence and Technology and National Security</a> that I have been finding quite useful.  Recommended.</p>
<p>Thus, my interest was piqued when it referred to a Web site called <a title="Intelligence Search - The Web's Only Underground Intelligence Search Portal" href="http://www.intelligencesearch.com/about.html">Intelligence Search</a>:  That Web site claims:</p>
<blockquote><p><em>&#8220;Intelligence Search is the only totally free spy and intelligence web site that searches through creditable web sites to deliver quality information to its visitors. Intelligence Search is free of adware, spyware and pop-ups and does not ask its visitors for donations. Intelligence Search also allows freedom of speech in the written word, so visitors get the purest form of intelligence information possible. &#8220;</em></p></blockquote>
<p>Hmmm, sounds useful and interesting.  So, I tried &#8216;ODNI&#8217; as a search term (for Office of the Director of National Intelligence &#8212; the new intel oversight position created by the<em> Intelligence Reform and Terrorism Prevention Act of 2004</em>, with Ambassador John D. Negroponte its first director) and only got one result (and that one not even the <a title="Office of the Director of National Intelligence" href="http://www.odni.gov/">ODNI&#8217;s</a> home page!).  A similar Yahoo! search turns up 84,200 hits, of which 200 are excellent results after applying further query refinements.  (Of course, Yahoo! does not include any deep Web content, so a truly useful compendium would likely have 1,000 documents or more.)  Numerous other searches I tried produced similarly meager results from Intelligence Search in comparison to what is available.</p>
<p>I think the intent of Intelligence Search is laudable and I like its clean interface.  However, I can not recommend it at this time until content coverage actually becomes useful.  Perhaps the site&#8217;s developers need to consider better tools for harvesting and building content on their site.  I just may have some recommendations   &#8230;.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/131/recommended-intel-blog-and-disappointing-intelligence-search/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
