Posted:April 3, 2006

One way to look at 40 sites trying to achieve Web 2.0 is that each site only contributes Web 0.05.

There’s alot of stuff going on with Web 2.0 in "social" computing, some with implications about my own primary interests in the semantic Web.  Indeed, though all of us can link to Wikipedia for definitions, I doubt other than first checking out that source that most of us would agree with what Wikipedia defines as Web 2.0 .  That’s OK.

Nonetheless, we can see there IS something going on in the nexus of new interoperable Web standards with collaboration and application frameworks specifically geared to shared experiences and information.  I think we can all agree that Web 2.0 is meant to achieve that, and that "social bookmarking" is one of the foundational facets of the phenomenon.

Like other things that take you in tangents while pursuing research stuff over a weekend, I’m actually not sure what got me trying to track down and understand "social bookmarking."  But tracking it down I did, and this post is the result of my cruising through the byways of Web 2.0 driving a "social bookmarks" roadster.

Quick Intro to Social Bookmarks

According to Wikipedia, a "social bookmark" is a:

. . . web based services where shared lists of user-created Internet bookmards are displayed. Social bookmarking sites generally organize their content using tags and are an increasingly popular way to locate, classify, rank, and share Internet resources . . . The concepts of social bookmarking and tagging took root with the launch of a web site called del.icio.us in approximately 2003.

Often, [social bookmark] lists are publicly accessible, although some social bookmarking systems allow for privacy on a bookmark by bookmark basis. They [may] also categorize their resources by the use of informally assigned, user-defined keywords or tags from a folksonomy). Most social bookmarking services allow users to search for bookmarks which are associated with given "tags", and rank the resources by the number of users which have bookmarked them.. . . . as people bookmark resources that they find useful, resources that are of more use are bookmarked by more users. Thus, such a system [can] "rank" a resource based on its perceived utility.

Since the classification and ranking of resources is a continuously evolving process, many social bookmarking services [may also] allow users to subscribe to syndication feeds or collections of tag terms. This allows subscribers to become aware of new resources for a given topic, as they are noted, tagged, and classified by other users. There are drawbacks to such tag-based systems as well: no standard set of keywords, no standard for the structure of such tags, mistagging, etc. . . . . The separate (but related) tagging and social bookmarking services are, however, evolving rapidly, and these shortcomings will likely either be addressed in the near future or shown not to be relevant to these services.

The idea of experts and interested individuals sharing their discoveries and passions is clearly compelling.  What has been most interesting in the development of "social bookmarking" software and services on the Web has been the assumptions underlying how those obejctives can best be achieved.

Of course, the most powerful concept underlying all of this stuff has been the ideal of "community."  We now face (have the opportunity) for electronic tribes and all that means in breaking former bounds of space and circumstance.  Truly, the prospect of finding effective means for the identification, assembly, consensus-building, and sharing within meaningful communities is breathtaking. 

Listing of Social Bookmarking Services

To get a handle on the state of the art, I began assembling a list of social bookmark and closely related services from various sources.  I’ve found about 40 of them, which may mean there are on order of 50 or so extant.  The icons and links below show these 40 or so sites, with a bit of explanation on each:

43 — 43Things — this site is geared for individuals to share activity lists, ambitions or "thngs to do" with one another.

Backflip — Backflip — this is a bookmark recollection and personal search space and directory. It has received a top 100 site from PC Magazine.

blinkbits — blinkbits — this is a social bookmarking site that has about 16,000 "blinks" or topic folders.

BlinkList — BlinkList –this site also allows bookmarks to be filtered by friends and collaborators.

Bloglines — Bloglines — beyond a simple social bookmark service, this site more importantly provides an RSS feeder and aggregator; owned by Ask Jeeves.

Blogmarks — Blogmarks — there is not much background info on this site; it is a somewhat better designed but offers typical social bookmarks services.

CiteULike – this site is geared toward academics and the sharing of paper references and links. Many references are to subscription papers. Generally, all submissions have an edited abstract and pretty accruate tags provided.

Connotea — Connotea — while the functionality of this stie is fairly standard for social bookmarking and activity is lower than some other sites, Connotea has a specific emphasis on technical, research, and academic topics that may make it more attractive to that audience.

del.icio.us — del.icio.us — this site is the granddaddy of social bookmark services, plus tagging support, plus is the first to use a very innovative URL. Amongst all the sites herein, this one probably has the greatest activity and number of listings.

De.lirio.us — de.lirio.us — this site is now being combined with simpy.com.

Digg — Digg — the Digg service is similar to others on this listing by providing social bookmarking, voting and popularity, and user control of listings, etc. It has received some buzz in the blog community.

Fark — Fark — while this site has aspects of social bookmarking, it is definiitely more inclined to be edgy and current.

Findory — Findory — geared toward news and blogs aggregation.

Flickr — Flickr — the largest and best known of the photo sharing and bookmarking sites; owned by Yahoo.

Furl — Furl — this site, part of LookSmart, has what you would expect from a bucks-backed site, but seems pretty vanilla with respect to social bookmarking capabilities.

Hyperlinkomatic — a beta service from the UK that has ceased accepting new users.

Jots — a small, and not notably distinguised, social bookmark site.

Kinja — Kinja — this is a blog bookmarking and aggregation service.

Linkroll — this is a relatively low-key service, modeled to a great extent on del.icio.us

Lookmarks –this is a social bookmarking service with tags, sharing, search and popular lists, with images and music/video sharing as well.

Ma.gnolia — Ma.gnolia –this service is a fairly standard social bookmarking site.

Maple –this is a fairly standard social bookmarking service, small with about 5,500 users, that uses Ruby on Rails.

Netvouz — Netvouz — this service is a fairly standard social bookmarking service that also provides tags.

Oyax — this is another fairly standard online bookmarks manager.

RawSugar — RawSugar –this site has most of the standard social bookmarking features, but differentiates by adding various user-defined directory structures.

Reddit — Reddit — the site has recently gotten some buzz due to a voting feature that moves topic rankings up or down based on user feedback; other aspects of the site are fairly vanilla.

Rojo — Rojo — this is a very broad RSS feed reader with hundreds of sources, to which you may add your own. It allows you to organize feeds by tags, share your feeds via an address book, and tracks and ranks what you view most often.  This site has been getting quite a bit of buzz.

Scuttle –Scuttle — this is a fairly standard social bookmarking site with low traffic.

Shadows — Shadows — this social bookmark site is attractively designed and adds a different wrinkle by letting any given topic or document to have its own community discussion page.

Shoutwire — Shoutwire — this site adds community feedback and collaboration to a "standard" RSS news feeder and aggregator.

Smarking — Smarking — this site is a fairly standard social bookmarking site.

Spurl — Spurl — this site is a fairly standard social bookmarking site.

Squidoo — Squidoo — this site is different from other social bookmarking services in that it lets you create a page on your topic of choice (called a lens) where you add links, text, pictures and other pieces of content. Each lens is tagged.

Start — an experimental Microsoft personalized home page service, powered by Ajax; capabilities and direction are still unclear.

TailRank — TailRank — this site allows about 50,000 blogs to be monitored in a fairly standard social bookmarking manner.

Unalog — Unalog — this is a fairly standard social bookmarking site.

Wink – this service is both a social bookmarker and a search engine to other online resources such as del.icio.us and digg.

Wists — Wists — this is a social bookmarking site geared to sharing shopping links and sites.

YahooMyWeb — Yahoo’s MyWeb — this is the personalized entry portal for Yahoo! including bookmarking and many specialty feeds and customization.

— Zurpy — this social bookmark service is in pre-launch phase.

General Observations

I personally participate in a couple of these services, notably Bloglines and Rojo.  Some of what I have discovered will compel me to try some others.

In testing out and assembling this list, however, I do have some general observations:

  • Most sites are repeats or knock-offs of the original del.icio.us.  While some offer prettier presentation and images, functionality is pretty identical.  These are what I refer to as the "fairly vanilla" or standard sites above
  • Systems that combine bookmarking with tagging and directory presentations seem most useful (at least to me) for the long haul.  Also of interest are those sites that focus on narrower and more technical communities (e.g., Connotea, CiteULike).  
  • Virtually all sites had poor search capabilities, particularly in advanced search or operator support, and were not taking full advantage of the tagging structure in their listings
  • Development of directory and hierarchical structures is generally poor, with little useful depth or specificity.  This may improve as use grows, as it has in Wikipedia, but limits real expert use at present, and
  • Thus, paradoxically, while the sites and services themselves in their current implementation are very helpful for initial discovery, they are of little or no use for expert discovery or knowledge discovery

I suspect most of these limitations will be overcome over time, and perhaps very shortly at that.  Technology certainly does not appear to be the limiting factor, but rather the needs for scales of use and the network effect.

Can We Get to Web 2.0 by Adding Multiple 0.05s?

Another paradox is that while these sites help promote the concept of community, they seem to work to actually fragment communities.  There’s much competition at present for many of the same people trying to do the same social things and collaboration.  One way to look at 40 sites trying to achieve Web 2.0 is that each site only contributes Web 0.05.

Specific iinnovative communities on the Web such as biologists, physicists, librarians and the like will be some of the most successful for leveraging these technologies for community growth and sharing.  In other communities, certainly competition will winnow out only a few survivors.

The older, centrally imposed means for communities to determine "authoritativeness" — be it peer review, library purchasing decisions, societal recognition or reputation, publisher selection decisions, citation indexes, etc. — do not easily apply to the distributed, chaotic Internet.  What others in your community find of value, and thus choose to bookmark and share, is one promising mechanism to bring some semblance of authoritativeness to the medium.  Of course, for this truly to work, there must be trust and respect within the communities themselves.

I think we should see within the foreseeable future a standard set of functionalities — submitting, ranking, organizing, searching, commentating, collaborating, annotating, exporting, importing, and self-policing — that will allow these community sites to become go-to "infohubs" for their users.  These early social bookmarking services look to be the nucleates that will condense stronger and diverse communities of interest on the Web. Let the maturation begin.

Posted:April 1, 2006

I just came across a pretty neat site and service for creating vertical search engines of your choosing.  Called a ‘swicki’ the service and capabilitiy is provided by Eurekster, a company founded about two years ago around the idea of personalized and social search.  The ‘swicki’ implementation was first released in November 2005.

SWISHer

by Michael K Bergman from http://www.mkbergman.com

community
powered
Swicki
HOT SEARCHES

 
This is a swicki – a search engine that learns from the search behavior of your community.
Get your own swicki from Eurekster for free!

NOTE: As you conduct searches using the form above, you will be taken from my blog to http://swisher-swicki.eurekster.com. To return, simply use your browser back button.
 
What in Bloody Hell is a Swicki? 

According to the company:

Swickis are a new kind of search engine or search results aggregator. Swickis allow you to build specific searches tailored to your interests and that of your community and get constantly updated results from your web or blog page. Swickis scan all the data indexed in Yahoo Search, plus all additional sources you specify, and present the results in a dynamically updated, easy to use format that you can publish on your site – or use at swicki.com. We also collect and organize information about all public swickis in our Directory. Whether you have built a swicki or not, you can come to the swicki directory and find swicki search engines that interest you.

Swickis are like wikis in that they are collaborative. Not only does your swicki use Eurekster technology to weight searches based on the behavior of those who come to your site, in the future, your community – if you allow them – can actively collaborate to modify and focus the results of the search engine. . . . Every click refines the swicki’s search strings, creating a responsive, dynamic result that’s both customized and highly relevant.

A 10 Minute Set-up 

I first studied the set-up procedure and then gathered some information before I began my own swicki.  Overall the process was pretty straightforward and took me about 10 minutes.  You begin the process on the Eurekster swicki home page.

  • Step 1:  You begin by customizing how you want the swicki to look — wide or narrow, long or short, and font sizes and a choice of about twenty background and font color combinations. I thought these customization options were generally the most useful ones and the implementation pretty slick
  • Step 2:  You "train" your search (actually, just specify useful domains and URLs and excluded ones).  Importantly, you give the site some keywords or phrases to qualify final results accepted for the site.  One nice feature is to add or not blog content or the content of your existing web site
  •  Step 3:  You then provide a short description for the site and assign it to existing subject categories.  Code is generated at this last step that is simple to insert into your Web site or blog, with some further explanations for different blog environments.

You are then ready to post the site and make it available to collaborative feedback and refinement.  You can also choose to include ads on the site or look to other means to monetize it should it become popular.

If a public site, your swicki is then listed on the Eurekster directory; as of this posting, there were about 2,100 listed swickis (more in a next post on that).

For business or larger site complexes, there are also paid versions building from this core functionality.

SWISHER:  Giving it My Own Test Drive

I have been working in the background for some time on an organized subject portal and directory for this blog called SWISHer — for Semantic Web, Interoperability, Standards and HTML.  (Much more is to be provided on this project at a later time.)  Since it is intended to be an expert’s repository of all relevant Web documents, the SWISHer acronym is apparent.

One of the things that you can do with the Eurekster swicki is run a direct head-to-head comparison of results with Google.  That caused me to think that it would also be interesting when I release my own SWISHer site to compare it with the swicki and with Google.  Thus, the subject of my test swicki was clear.

Since I know the semantic Web reference space pretty well, I chose about 75 key starting URLs to use as the starting "training" set for the swicki.

This first version of SWISHer as a swicki site, with its now-embedded generated code, is thus what appears above.  In use it indicates links to about 400,000 results, though the search function is pretty weak and it is difficult to use some of my standard tricks to ascertain the actual number of  documents in the available index.

To see the swicki site in action, either go to  http://swisher-swicki.eurekster.com, click on the SWISHer title, or enter your search in the form above and click search.

Now installed, I’m taking these capabilities for a longer road trip.  The test drive was fun; let’s see how it handles over rough terrain and covering real distances.  I’ll post impressions in a day or so. 

Posted:March 25, 2006

Henry Story, one of my favorite semantic Web bloggers and a Sun development guru, has produced a very useful video and PDF series on the semantic Web.  Here is the excerpt from his site with details about where to get the 30 min presentation (62 MB for the QuickTime version, see below), highly useful to existing development staff:

. . . how could the SemWeb affect software development in an Open Source world, where there are not only many more developers, but also these are distributed around the world with no central coordinating organisation? Having presented the problem, I then introduce RDF and Ontologies, how this meshes with the Sparql query language, and then show how one could use these technologies to make distributed software development a lot more efficient.

Having given the presentation in November last year, I spent some time over Xmas putting together a video of it (in h.264 format). . . .  Then last week I thought it would be fun to put it online, and so I placed it on Google video, where you can still find it. But you will notice that Google video reduces the quality quite dramatically, so that you will really need to have the pdf side by side, if you wish to follow.

Your time spent with this presentation will be time well spent. I’d certainly like to hear more about OWL, or representing and resolving semantic heterogeneities, or efficient RDF storage databases at scale, or a host of other issues of personal interest. But, hey, perhaps there are more presentations to come!

Posted by AI3's author, Mike Bergman Posted on March 25, 2006 at 4:13 pm in Semantic Web | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/196/30-minute-video-on-the-semantic-web/
The URI to trackback this post is: http://www.mkbergman.com/196/30-minute-video-on-the-semantic-web/trackback/
Posted:March 23, 2006

Author’s Note: This is an on line version of a paper that Mike Bergman recently released under the auspices of BrightPlanet Corp The citation for this effort is:

M.K. Bergman, “Tutorial:  Internet Languages, Character Sets and Encodings,” BrightPlanet Corporation Technical Documentation, March 2006, 13 pp.

Download PDF file Click here to obtain a PDF copy of this posting (13 pp, 79 K)

Broad-scale, international open source harvesting from the Internet poses many challenges in use and translation of legacy encodings that have vexed academics and researchers for many years. Successfully addressing these challenges will only grow in importance as the relative percentage of international sites grows in relation to conventional English ones.

A major challenge in internationalization and foreign source support is “encoding.” Encodings specify the arbitrary assignment of numbers to the symbols (characters or ideograms) of the world’s written languages needed for electronic transfer and manipulation. One of the first encodings developed in the 1960s was ASCII (numerals, plus a-z; A-Z); others developed over time to deal with other unique characters and the many symbols of (particularly) the Asiatic languages.

Some languages have many character encodings and some encodings, for example Chinese and Japanese, have very complex systems for handling the large number of unique characters. Two different encodings can be incompatible by assigning the same number to two distinct symbols, or vice versa. So-called Unicode set out to consolidate many different encodings, all using separate code plans into a single system that could represent all written languages within the same character encoding. There are a few Unicode techniques and formats, the most common being UTF-8.

The Internet was originally developed via efforts in the United States funded by ARPA (later DARPA) and NSF, extending back to the 1960s. At the time of its commercial adoption in the early 1990s via the Word Wide Web protocols, it was almost entirely dominated by English by virtue of this U.S. heritage and the emergence of English as the lingua franca of the technical and research community.

However, with the maturation of the Internet as a global information repository and means for instantaneous e-commerce, today’s online community now approaches 1 billion users from all existing countries. The Internet has become increasingly multi-lingual.

Efficient and automated means to discover, search, query, retrieve and harvest content from across the Internet thus require an understanding of the source human languages in use and the means to encode them for electronic transfer and manipulation. This Tutorial provides a brief introduction to these topics.

Internet Language Use

Yoshiki Mikami, who runs the UN’s Language Observatory, has an interesting way to summarize the languages of the world. His updated figures, plus some other BrightPlanet statistics are:[1]

Category

Number

Source or Notes

Active Human Languages

6,912

from www.ethnologue.com
Language Identifiers

440

based on ISO 639
Human Rights Translation

327

UN’s Universal Declaration of Human Rights (UDHR)
Unicode Languages

244

see text
DQM Languages

140

estimate based on prevalence, BT input
Windows XP Languages

123

from Microsoft
Basis Tech Languages

40

based on Basis Tech’s Rosette Language Identifier (RLI)
Google Search Languages

35

from Google

There are nearly 7,000 living languages spoken today, though most have few speakers and many are becoming extinct. About 347 (or approximately 5%) of the world’s languages have at least one million speakers and account for 94% of the world’s population. Of this amount, 83 languages account for 80% of the world’s population, with just 8 languages with greater than 100 million speakers accounting for about 40% of total population. By contrast, the remaining 95% of languages are spoken by only 6% of the world’s people.[2]

This prevalence is shown by the fact that the UN’s Universal Declaration of Human Rights (UDHR) has only been translated into those languages generally with 1 million or more speakers.

The remaining items on the table above enumerate languages that can be represented electronically, or are “encoded.” More on this topic is provided below.

Of course, native language does not necessarily equate to Internet use, with English predominating because of multi-lingualism, plus the fact that richer countries or users within countries exhibit greater Internet access and use.

The most recent comprehensive figures for Internet language use and prevalence are from the Global Reach Web site for late 2004, with only percentage figures shown for ease of reading for those countries with greater than a 1.0% value:[3] [4]

Percent of

2003 Internet Users

Global Population

Web Pages

Millions

Percent

Millions

Percent

ENGLISH

68.4%

287.5

35.6%

508

8.0%

NON-ENGLISH

31.6%

519.6

64.4%

5,822

92.0%

EUROPEAN (non-English)
Catalan

2.9

7

Czech

4.2

12

Dutch

13.5

1.7%

20

Finnish

2.8

6

French

3.0%

28.0

3.5%

77

1.2%

German

5.8%

52.9

6.6%

100

1.6%

Greek

2.7

12

Hungarian

1.7

10

Italian

1.6%

24.3

3.0%

62

1.0%

Polish

9.5

1.2%

44

Portuguese

1.4%

25.7

3.2%

176

2.8%

Romanian

2.4

26

Russian

1.9%

18.5

2.3%

167

2.6%

Scandinavian

14.6

1.8%

20

Danish

3.5

5

Icelandic

0.2

0

Norwegian

2.9

5

Swedish

7.9

1.0%

9

Serbo-Croatian

1.0

20

Slovak

1.2

6

Slovenian

0.8

2

Spanish

2.4%

65.6

8.1%

350

5.5%

Turkish

5.8

67

1.1%

Ukrainian

0.9

47

SUB-TOTAL

18.7%

279.0

34.6%

1,230

19.4%

ASIAN LANGUAGES
Arabic

10.5

1.3%

300

4.7%

Chinese

3.9%

102.6

12.7%

874

13.8%

Farsi

3.4

64

1.0%

Hebrew

3.8

5

Japanese

5.9%

69.7

8.6%

125

2.0%

Korean

1.3%

29.9

3.7%

78

1.2%

Malay

13.6

1.7%

229

3.6%

Thai

4.9

46

Vietnamese

2.2

68

1.1%

SUB-TOTAL

12.9%

240.6

29.8%

1,789

28.3%

TOTAL WORLD

100.0%

807.1

100.0%

6,330

100.0%

English speakers have nearly a five-fold increase in Internet use than sheer population would suggest, and about an eight-fold increase in percent of English Web pages. However, various census efforts over time have shown a steady decrease in this English prevalence (data not shown.)

Virtually all European languages show higher Internet prevalence than actual population would suggest; Asian languages show the opposite. (African languages are even less represented than population would suggest; data not shown.)

Internet penetration appears to be about 20% of global population and growing rapidly. It is not unlikely that percentages of Web users and the pages the Web is written in will continue to converge to real population percentages. Thus, over time and likely within the foreseeable future, users and pages should more closely approximate the percentage figures shown in the rightmost column in the table above.

Script Families

Another useful starting point for understanding languages and their relation to the Internet is a 2005 UN publication from a World Summit on the Information Society. This 113 pp. report can be found at http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf.[5]

Languages have both a representational form and meaning. The representational form is captured by scripts, fonts or ideograms. The meaning is captured by semantics. In an electronic medium, it is the representational form that must be transmitted accurately. Without accurate transmittal of the form, it is impossible to manipulate that language or understand its meaning.

Representational forms fit within what might be termed script families. Script families are not strictly alphabets or even exact character or symbol matches. They represent similar written approaches and some shared characteristics.

For example, English and its German and Romance language cousins share very similar, but not identical, alphabets. Similarly, the so-called CJK (Chinese, Japanese, Korean) share a similar approach to using ideograms without white space between tokens or punctuation.

At the highest level, the world’s languages may be clustered into these following script families:[6]

Script

Latin

Cyrillic

Arabic

Hanzi

Indic

Others*

Million users

2,238

451

462

1,085

807

129

% of Total

43.3%

8.7%

8.9%

21.0%

15.6%

2.5%

Key languagesRomance (European) Slavic (some) Vietnamese Malay IndonesianRussian Slavic (some) Kazakh UzbekArabic Urdu Persian PashtuChinese Japanese KoreanHindi Tamil Bengali Punjabi Sanskrit ThaiGreek Hebrew Georgian Assyrian Armenian

Note that English and the Romance languages fall within the Latin script family, the CJK within Hanzi. The “Other” category is a large catch-all, including Greek, Hebrew, many African languages, and others. However, besides Greek and Hebrew, most specific languages of global importance are included in the other named families. Also note that due to differences in sources, that total user counts do not equal earlier tables.

Character Sets and Encodings

In order to take advantage of the computer’s ability to manipulate text (e.g., displaying, editing, sorting, searching and efficiently transmitting it), communications in a given language needs to be represented in some kind of encoding. Encodings specify the arbitrary assignment of numbers to the symbols of the world’s written languages. Two different encodings can be incompatible by assigning the same number to two distinct symbols, or vice versa. Thus, much of what the Internet offers with respect to linguistic diversity comes down to the encodings available for text.

The most widely used encoding is the American Standard Code for Information Interchange (ASCII), a code devised during the 1950s and 1960s under the auspices of the American National Standards Institute (ANSI) to standardize teletype technology. This encoding comprises 128 character assignments (7-bit) and is suitable primarily for North American English.[6]

Historically, other languages that did not fit in the ASCII 7-bit character set (a-z; A-Z) pretty much created their own character sets, sometimes with local standards acceptance and sometimes not. Some languages have many character encodings and some encodings, particularly Chinese and Japanese, have very complex systems for handling the large number of unique characters. Another difficult group is Hindi and the Indic language family, with speakers that number into the hundreds of millions. According to one University of Southern California researcher, almost every Hindi language web site has its own encoding.[7]

The Internet Assigned Names and Authority (IANA) organization maintains a master list of about 245 standard charset (“character set”) encodings and 550 associated aliases to the same used in one manner or another on the Internet.[8] [9] Some of these electronic encodings were created by large vendors with a stake in electronic transfer such as IBM, Microsoft, Apple and the like. Other standards result from recognized standards organizations such as ANSI, ISO, Unicode and the like. Many of these standards date back as far as the 1960s; many others are specific to certain countries.

Earlier estimates showed on the range of 40 to 250 languages per named encoding type. While no known estimate exists, if one assumes 100 languages for each of the IANA-listed encodings, there could be on the order of 25,000 or so specific language-encoding combinations possible on the Internet based on these “standards.” There are perhaps thousands of specific language encodings also extant.

Whatever the numbers, clearly it is critical to identify accurately the specific encoding and its associated language for any given Web page or database site. Without this accuracy, it is impossible to electronically query and understand the content.

As might be suspected, this topic too is very broad. For a very comprehensive starting point on all topics related to encodings and character sets, please see I18N (which stands for “internationalization”) Guy’s Web site at http://www.i18nguy.com/unicode/codepages.html.

Unicode

In the late 1980s, there were two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Fortunately, the participants of both projects realized in 1991 that two different unified character sets did not make sense and they joined efforts to create a single code table, now referred to as Unicode. While both projects still exist and publish their respective standards independently, the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and closely coordinated.

Unicode sets out to consolidate many different encodings, all using separate code plans into a single system that can represent all written languages within the same character encoding. Unicode is first a set of code tables to assign integer numbers to characters, also called a code point. Unicode then has several methods for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes, generally prefixed by “UTF.”

In UTF-8, the most common method, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3 or up to 6 bytes. This method has the advantage that English text looks exactly the same in UTF-8 as it did in ASCII, so ASCII is a conforming sub-set. More unusual characters such as accented letters, Greek letters or CJK ideograms may need several bytes to store a single code point.

The traditional store-it-in-two-byte method for Unicode is called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits). There’s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero. There’s UTF-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes. There is also UTF-32 that stores the code point in 32 bits but requires more storage. Regardless, UTF-7, -8, -16, and -32 all have the property of being able to store any code point correctly.

BrightPlanet, along with many others, has adopted UTF-8 as the standard Unicode method to process all string data. There are tools available to convert nearly any existing character encoding into a UTF-8 encoded string. Java supplies these tools as does Basis Technolgy, one of BrightPlanet’s partners in language processing.

As presently defined, Unicode supports about 245 common languages according to a variety of scripts (see notes at end of the table):[10]

Language

Script(s)

Some Country Notes

AbazaCyrillic
AbkhazCyrillic
AdygeiCyrillic
AfrikaansLatin
AinuKatakana, LatinJapan
AisorCyrillic
AlbanianLatin [2]
AltaiCyrillic
AmharicEthiopicEthiopia
AmoLatinNigeria
ArabicArabic
ArmenianArmenian, Syriac [3]
AssameseBengaliBangladesh, India
Assyrian (modern)Syriac
AvarCyrillic
AwadhiDevanagariIndia, Nepal
AymaraLatinPeru
AzeriCyrillic, Latin
AzerbaijaniArabic, Cyrillic, Latin
BadagaTamilIndia
BagheliDevanagariIndia, Nepal
BalearLatin
BalkarCyrillic
BaltiDevanagari, Balti [2]India, Pakistan
BashkirCyrillic
BasqueLatin
BatakBatak [1], LatinPhilippines, Indonesia
Batak tobaBatak [1], LatinIndonesia
BateriDevanagari(aka Bhatneri) India, Pakistan
BelarusianCyrillic(aka Belorussian, Belarusan)
BengaliBengaliBangladesh, India
BhiliDevanagariIndia
BhojpuriDevanagariIndia
BihariDevanagariIndia
BosnianLatinBosnia-Herzegovina
Braj bhashaDevanagariIndia
BretonLatinFrance
BugisBuginese [1]Indonesia, Malaysia
BuhidBuhidPhilippines
BulgarianCyrillic
BurmeseMyanmar
BuryatCyrillic
BahasaLatin(see Indonesian)
CatalanLatin
ChakmaBengali, Chakma [1]Bangladesh, India
ChamCham [1]Cambodia, Thailand, Viet Nam
ChechenCyrillicGeorgia
CherokeeCherokee, Latin
ChhattisgarhiDevanagariIndia
ChineseHan
ChukchiCyrillic
ChuvashCyrillic
CopticGreekEgypt
CornishLatinUnited Kingdom
CorsicanLatin
CreeCanadian Aboriginal Syllabics, Latin
CroatianLatin
CzechLatin
DanishLatin
DargwaCyrillic
DhivehiThaanaMaldives
DunganCyrillic
DutchLatin
DzongkhaTibetanBhutan
EdoLatin
EnglishLatin, Deseret [3], Shavian [3]
EsperantoLatin
EstonianLatin
EvenkiCyrillic
FaroeseLatinFaroe Islands
FarsiArabic(aka Persian)
FijianLatin
FinnishLatin
FrenchLatin
FrisianLatin
GaelicLatin
GagauzCyrillic
GarhwaliDevanagariIndia
GaroBengaliBangladesh, India
GasconLatin
Ge’ezEthiopicEritrea, Ethiopia
GeorgianGeorgian
GermanLatin
GondiDevanagari, TeluguIndia
GreekGreek
GuaraniLatin
GujaratiGujarati
GarshuniSyriac
HanunóoLatin, HanunóoPhilippines
HarautiDevanagariIndia
HausaLatin, Arabic [3]
HawaiianLatin
HebrewHebrew
HindiDevanagari
HmongLatin, Hmong [1]
HoDevanagariBangladesh, India
HopiLatin
HungarianLatin
IbibioLatin
IcelandicLatin
IndonesianArabic [3], Latin
IngushArabic, Latin
InuktitutCanadian Aboriginal Syllabics, LatinCanada
IñupiaqLatinGreenland
IrishLatin
ItalianLatin
JapaneseHan + Hiragana + Katakana
JavaneseLatin, Javanese [1]
JudezmoHebrew
KabardianCyrillic
KachchiDevanagariIndia
KalmykCyrillic
KanaujiDevanagariIndia
KankanDevanagariIndia
KannadaKannadaIndia
KanuriLatin
KhantyCyrillic
KarachayCyrillic
KarakalpakCyrillic
KarelianLatin, Cyrillic
KashmiriDevanagari, Arabic
KazakhCyrillic
KhakassCyrillic
KhamtiMyanmarIndia, Myanmar
KhasiLatin, BengaliBangladesh, India
KhmerKhmerCambodia
KirghizArabic [3], Latin, Cyrillic
KomiCyrillic, Latin
KonkanDevanagari
KoreanHangul + Han
KoryakCyrillic
KurdishArabic, Cyrillic, LatinIran, Iraq
KuyThaiCambodia, Laos, Thailand
LadinoHebrew
LakCyrillic
LambadiTeluguIndia
LaoLaoLaos
LappLatin(see Sami)
LatinLatin
LatvianLatin
Lawa, easternThaiThailand
Lawa, westernThaiChina, Thailand
LepchaLepcha [1]Bhutan, India, Nepal
LezghianCyrillic
LimbuDevanagari, Limbu [1]Bhutan, India, Nepal
LisuLisu (Fraser) [1], LatinChina
LithuanianLatin
LushootseedLatinUSA
LuxemburgishLatin(aka Luxembourgeois)
MacedonianCyrillic
MalayArabic [3], LatinBrunei, Indonesia, Malaysia
MalayalamMalayalam
MaldivianThaanaMaldives (See Dhivehi)
MalteseLatin
ManchuMongolianChina
MansiCyrillic
MarathiDevanagariIndia
MariCyrillic, Latin
MarwariDevanagari
MeiteiMeetai Mayek [1], BengaliBangladesh, India
MoldavianCyrillic
MonMyanmarMyanmar, Thailand
MongolianMongolian, CyrillicChina, Mongolia
MordvinCyrillic
MundariBengali, DevanagariBangladesh, India, Nepal
NagaLatin, BengaliIndia
NanaiCyrillic
NavajoLatin
NaxiNaxi [2]China
NenetsCyrillic
NepaliDevanagari
NetetsCyrillic
NewariDevanagari, Ranjana, Parachalit
NogaiCyrillic
NorwegianLatin
OriyaOriyaBangladesh, India
OromoEthiopicEgypt, Ethiopia, Somalia
OsseticCyrillic
PaliSinhala, Devanagari, ThaiIndia, Myanmar, Sri Lanka
PanjabiGurmukhiIndia (see Punjabi)
Parsi-dariArabicAfghanistan, Iran
PashtoArabicAfghanistan
PolishLatin
PortugueseLatin
ProvençalLatin
PrussianLatin
PunjabiGurmukhiIndia
QuechuaLatin
RiangBengaliBangladesh, China, India, Myanmar
RomanianLatin, Cyrillic [3](aka Rumanian)
RomanyCyrillic, Latin
RussianCyrillic
SamiCyrillic, Latin
SamaritanHebrew, Samaritan [1]Israel
SanskritSinhala, Devanagari, etc.India
SantaliDevanagari, Bengali, Oriya, Ol Cemet [1]India
SelkupCyrillic
SerbianCyrillic
ShanMyanmarChina, Myanmar, Thailand
SherpaDevanagari
ShonaLatin
ShorCyrillic
SindhiArabic
SinhalaSinhala(aka Sinhalese) Sri Lanka
SlovakLatin
SlovenianLatin
SomaliLatin
SpanishLatin
SwahiliLatin
SwedishLatin
SylhettiSiloti Nagri [1], BengaliBangladesh
SyriacSyriac
SwadayaSyriac(see Syriac)
TabasaranCyrillic
TagalogLatin, Tagalog
TagbanwaLatin, Tagbanwa
TahitianLatin
TajikArabic [3], Latin, Cyrillic (? Latin)(aka Tadzhik)
TamazightTifinagh [1], Latin
TamilTamil
TatCyrillic
TatarCyrillic
TeluguTelugu
ThaiThai
TibetanTibetan
TigreEthiopicEritrea, Sudan
Tsalagi(see Cherokee)
TuluKannadaIndia
TurkishArabic [3], Latin
TurkmenArabic [3], Latin, Cyrillic (? Latin)
TuvaCyrillic
TuroyoSyriac(see Syriac)
UdekheCyrillic
UdmurtCyrillic, Latin
UighurArabic, Latin, Cyrillic, Uighur [1]
UkranianCyrillic
UrduArabic
UzbekCyrillic, Latin
ValencianLatin
VietnameseLatin, Chu Nom
YakutCyrillic
YiYi, Latin
YiddishHebrew
YorubaLatin
[1] = Not yet encoded in Unicode.
[2] = Has one or more extinct or minor native script(s), not yet encoded.
[3] = Formerly or historically used this script, now uses another.

Notice most of these scripts fall into the seven broader script families such as Latin, Hanzi and Indic noted previously.

While more countries are adopting Unicode and sample results indicate increasing percentage use, it is by no means prevalent. In general, Europe has been slow to embrace Unicode with many legacy encodings still in use, perhaps Arabic sites have reached the 50% level, and Asian use is problematic.[11] Other samples suggest that UTF-8 encoding is limited to 8.35% of all Asian Web pages. Some countries, such as Nepal, Vietnam and Tajikistan exceed 70% compliance, while others such Syria, Laos and Brunei are below even 1%.[12] According to the Archive Pass project, which also used Basis Tech’s RLI for encoding detection, Chinese sites are dominated by GB-2312 and Big 5 encodings, while Shift-JIS is most common for Japanese.[13]

Detecting and Communicating with Legacy Encodings

There are two primary problems when dealing with non-Unicode encodings; identifying what the encoding is and converting that encoding to a Unicode string, usually UTF-8. Detecting the encoding is a difficult process, BasisTech’s RLI does an excellent job. Converting the non-Unicode string to a Unicode string can be easily done using tools available in the Java JDK, or using BasisTech’s RCLU library.

Basis Tech detects a combination of 96 language encoding pairs involving 40 different languages and 30 unique encoding types:

Language

Encoding

AlbanianUTF-8, Windows-1252
ArabicUTF-8, Windows-1256, ISO-8859-6
Bahasa IndonesiaUTF-8, Windows-1252
Bahasa MalayUTF-8, Windows-1252
BulgarianUTF-8, Windows-1251, ISO-8859-5, KOI8-R
CatalanUTF-8, Windows-1252
ChineseUTF-8, GB-2312, HZ-GB-2312, ISO-2022-CN
ChineseUTF-8, Big5
CroatianUTF-8, Windows-1250
CzechUTF-8, Windows-1250
DanishUTF-8, Windows-1252
DutchUTF-8, Windows-1252
EnglishUTF-8, Windows-1252
EstonianUTF-8, Windows-1257
FarsiUTF-8, Windows-1256
FinnishUTF-8, Windows-1252
FrenchUTF-8, Windows-1252
GermanUTF-8, Windows-1252
GreekUTF-8, Windows-1253
HebrewUTF-8, Windows-1255
HungarianUTF-8, Windows-1250
IcelandicUTF-8, Windows-1252
ItalianUTF-8, Windows-1252
JapaneseUTF-8, EUC-JP, ISO-2022-JP, Shift-JIS
KoreanUTF-8, EUC-KR, ISO-2022-KR
LatvianUTF-8, Windows-1257
LithuanianUTF-8, Windows-1257
NorwegianUTF-8, Windows-1252
PolishUTF-8, Windows-1250
PortugueseUTF-8, Windows-1252
RomanianUTF-8, Windows-1250
RussianUTF-8, Windows-1251, ISO-8859-5, IBM-866, KOI8-R, x-Mac-Cyrillic
SlovakUTF-8, Windows-1250
SlovenianUTF-8, Windows-1250
SpanishUTF-8, Windows-1252
SwedishUTF-8, Windows-1252
TagalogUTF-8, Windows-1252
ThaiUTF-8, Windows-874
TurkishUTF-8, Windows-1254
VietnameseUTF-8, VISCII, VPS, VIQR, TCVN, VNI

Java SDK encoding/decoding supports 22 basic European, and 125 other international forms (mostly non-European), for 147 total. If an ecoded form is not on this list, and not already Unicode, software can not talk to the site without special converters or adapters. See http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html

Of course, to avoid the classic “garbage in, garbage out” (GIGO) problem, accurate detection must be made of the source’s encoding type, there must be a converter for that type into a canonical, internal form (such as UTF-8), and another converter must exist for converting that canonical form back to the source’s original encoding. The combination of the existing Basis Tech RLI and the Java SDK produce a valid combination of 89 language/encoding pairs (with invalid combinations shown in Bold Red above.)

Fortunately, existing valid combinations appear to cover all prevalent languages and encoding types. Should gaps exist, specialized detectors and converters may be required. As events move forward, the family of Indic languages may be the most problematic for expansion with standard tools.

Actual Language Processing

Encoding detection, and the resulting proper storage and language identification, is but the first essential step in actual language processing. Additional tools in morphological analysis or machine translation may need to be applied to address actual analyst needs. These tools are beyond the scope of this Tutorial.

The key point, however, is that all foreign language processing and analysis begins with accurate encoding detection and communicating with the host site in its original encoding. These steps are the sine qua non of language processing.

Exemplar Methodology for Internet Foreign Language Support

We can now take the information in this Tutorial and present what might be termed an exemplar methodology for initial language detection and processing. A schematic of this methodology is provided in the following diagram:

This diagram shows that the actual encoding for an original Web document or search form must be detected, converted into a standard “canonical” form for internal storage, but talked to in its actual native encoding form when searching it. Encoding detection software and utilities within the Java SDK can aid this process greatly.

And, as the proliferation of languages and legacy forms grows, we can expect such utilities to embrace an ever-widening set of encodings.


[1] Yoshiki Mikami, “Language Observatory: Scanning Cyberspace for Languages,” from The Second Language Observatory Workshop, February 21-25, 2005, 41 pp. See http://gii.nagaokaut.ac.jp/~zaidi/Proceedings%20Online/01_Mikami.pdf. This is a generally useful reference on Internet and language. Please note some of the figures have been updated with more recent data.

[2] See http://www.ethnologue.com/ethno_docs/distribution.asp?by=size.

[3] See http://global-reach.biz/globstats/index.php3. Also, for useful specific notes by country as well as orignial references, see http://global-reach.biz/globstats/refs.php3.

[4] Another interesting language source with an emphasis on Latin family langguages is FUNREDES’ 2005 study of languages and cultures. See http://funredes.org/LC/english/index.html.

[5] John Paolillo, Daniel Pimienta, Daniel Prado, et al. Measuring Linguistic Diversity on the Internet, a UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf

[6] John Paolillo, “Language Diversity on the Internet,” pp. 43-89, in John Paolillo, Daniel Pimienta, Daniel Prado, et al., Measuring Linguistic Diversity on the Internet, UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf.

[7] Information Sciences Institute press release, “USC Researchers Build Machine Translation System  –  and More — for Hindi in Less Than a Month,” June 30, 2003. See http://www.isi.edu/stories/60.html.

[8] http://www.iana.org/assignments/character-sets.

[9] The actual values were calculated from Jukka “Yucca” Korpela’s informative Web site at http://www.cs.tut.fi/%7Ejkorpela/chars/sorted.html.

[10] See http://www.unicode.org/onlinedat/languages-scripts.html.

[11] Pers. Comm., B. Margulies, Basis Technology, Inc., Feb. 27, 2006.

[12] Yoshika Mikami et al., “Language Diversity on the Internet: An Asian View,” pp. 91-103, in John Paolillo, Daniel Pimienta, Daniel Prado, et al., Measuring Linguistic Diversity on the Internet, UNESCO Publications for the World Summit on the Information Society 2005, 113 pp. See http://www.uis.unesco.org/template/pdf/cscl/MeasuringLinguisticDiversity_En.pdf.

[13] Archive Pass Project; see http://crawler.archive.org/cgi-bin/wiki.pl?ArchivePassProject

Posted:March 22, 2006

The ePrécis Web site showcases technology that creates abstracts from any text document. In this Web site search, web sites relevant to your search requests are analyzed by ePrécis and results are returned in a typical search format.

Richard McManus provides a background description in ZDNet about this technology, with more focus on its comparison to Google as a search engine or in relation to OWL semantic Web approaches.

According to the ePrécis white paper by James Matthewson:

ePrécis is not a program per se, but a C++ language application programmer interface (API) that can be embedded in any number of applications to return relevant outputs given a wide variety of natural language inputs. In addition to plugging into Web browsers or search engines, it could plug into word processing programs to automatically provide abstracts, executive summaries, back-of-the book indexes, and writing or translation support.”

You can get this white paper from the ePrécissite or download a macro to embed within MS Word to create your own abstracts and indexes.  (You will also need the Microsoft SOAP 3.0 package installed.)  Check it out; it’s kinda fun, and generally pretty impressive in creating useful abstracts.  You should also try the searches from the ePrécis Web site.  Hint: For best performance, use long or technical queries (more context).

Posted by AI3's author, Mike Bergman Posted on March 22, 2006 at 10:32 am in Information Automation, Searching, Semantic Web | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/151/eprecis-for-abstracts-and-indexes/
The URI to trackback this post is: http://www.mkbergman.com/151/eprecis-for-abstracts-and-indexes/trackback/