<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Tutorial:  Internet Languages, Character Sets and Encodings</title>
	<atom:link href="http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/</link>
	<description>Mike Bergman on the semantic Web and structured Web</description>
	<lastBuildDate>Wed, 01 Feb 2012 21:05:31 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: AI3 - Adaptive Information::: &#187; Blog Archive &#187; Guidance and Sample Code for Multi-Lingual Translations of Your Blog or Web Site</title>
		<link>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/comment-page-1/#comment-10342</link>
		<dc:creator>AI3 - Adaptive Information::: &#187; Blog Archive &#187; Guidance and Sample Code for Multi-Lingual Translations of Your Blog or Web Site</dc:creator>
		<pubDate>Thu, 24 Aug 2006 00:17:16 +0000</pubDate>
		<guid isPermaLink="false">http://www.mkbergman.com/?p=195#comment-10342</guid>
		<description>[...] For those of you that follow BrightPlanet, we have been moving aggressively for some time now into international document harvesting and all that that implies regarding language and encoding detection and roundtripping.&#160; In fact, there is a fairly definitive tutorial post on my blog that deals with these so-called i18n internationalization issues that has become quite the reference on these matters.&#160; With its partnership with Basis Tech, in fact, BrightPlanet now can harvest documents in about 140 different languages with accurate encoding translation in multiple legacy forms for about 40 of them and morphological analysis for another 20 or so.&#160; There can be no doubt that the need for multi-lingual searching and harvesting and encoding support is an abiding trend of the evolving Internet. [...]</description>
		<content:encoded><![CDATA[<p>[...] For those of you that follow BrightPlanet, we have been moving aggressively for some time now into international document harvesting and all that that implies regarding language and encoding detection and roundtripping.&nbsp; In fact, there is a fairly definitive tutorial post on my blog that deals with these so-called i18n internationalization issues that has become quite the reference on these matters.&nbsp; With its partnership with Basis Tech, in fact, BrightPlanet now can harvest documents in about 140 different languages with accurate encoding translation in multiple legacy forms for about 40 of them and morphological analysis for another 20 or so.&nbsp; There can be no doubt that the need for multi-lingual searching and harvesting and encoding support is an abiding trend of the evolving Internet. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: AI3 - Adaptive Information::: &#187; Blog Archive &#187; Sources and Classification of Semantic Heterogeneities</title>
		<link>http://www.mkbergman.com/195/tutorial-internet-languages-character-sets-and-encodings/comment-page-1/#comment-3762</link>
		<dc:creator>AI3 - Adaptive Information::: &#187; Blog Archive &#187; Sources and Classification of Semantic Heterogeneities</dc:creator>
		<pubDate>Tue, 06 Jun 2006 23:20:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.mkbergman.com/?p=195#comment-3762</guid>
		<description>[...] Most of these line items are self-explanatory, but a few may not be: Homonyms refer to the same name referring to more than one concept, such as Name referring to a person v. Name referring to a bookA generalization/specialization mismatch can occur when single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to &quot;phone&quot; but the other schema has multiple elements such as &quot;home phone,&quot; &quot;work phone&quot; and &quot;cell phone&quot;Intra-aggregation mismatches come when the same population is divided differently (Census v. Federal regions for states, or full person names v. first-middle-last, for examples) by schema, whereas inter-aggregation mismatches can come from sums or counts as added valuesInternal path discrepancies can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove)The four sub-types of schematic discrepancy refer to where attribute and element names may be interchanged between schemasUnder languages, encoding mismatches can occur when either the import or export of data to XML assumes the wrong encoding type. While XML is based on Unicode, it is important that source retrievals and issued queries be in the proper encoding of the source. For Web retrievals this is very important, because only about 4% of all documents are in Unicode, and earlier BrightPlanet provided estimates there may be on the order of 25,000 language-encoding pairs presently on the InternetEven should the correct encoding be detected, there are significant differences in different language sources in parsing (white space, for example), syntax and semantics that can also lead to many error types. [...]</description>
		<content:encoded><![CDATA[<p>[...] Most of these line items are self-explanatory, but a few may not be: Homonyms refer to the same name referring to more than one concept, such as Name referring to a person v. Name referring to a bookA generalization/specialization mismatch can occur when single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to &quot;phone&quot; but the other schema has multiple elements such as &quot;home phone,&quot; &quot;work phone&quot; and &quot;cell phone&quot;Intra-aggregation mismatches come when the same population is divided differently (Census v. Federal regions for states, or full person names v. first-middle-last, for examples) by schema, whereas inter-aggregation mismatches can come from sums or counts as added valuesInternal path discrepancies can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove)The four sub-types of schematic discrepancy refer to where attribute and element names may be interchanged between schemasUnder languages, encoding mismatches can occur when either the import or export of data to XML assumes the wrong encoding type. While XML is based on Unicode, it is important that source retrievals and issued queries be in the proper encoding of the source. For Web retrievals this is very important, because only about 4% of all documents are in Unicode, and earlier BrightPlanet provided estimates there may be on the order of 25,000 language-encoding pairs presently on the InternetEven should the correct encoding be detected, there are significant differences in different language sources in parsing (white space, for example), syntax and semantics that can also lead to many error types. [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>

