質問

Possible Duplicate:
Javascript: Unicode string split by chars

I have a javascript string which contains some tamil characters. I need to split them into individual unicode characters. The split method does not understand the http://en.wikipedia.org/wiki/Complex_text_layout Complex text layout.

For example:

Calling split("") on "கதிரவன்" returns:

,க,த,ி,ர,வ,ன,்

when I expected:

க,தி,ர,வ,ன்

What should be done to split unicode characters from a string properly ?

Edit: I can navigate these letters just fine in the browser (chrome). I am trying to use this js in a chrome extension. So I am fine if there is a chrome-specific solution too.

役に立ちましたか?

解決

This is totally doable.

First off, you create a set/dictionary which includes all diacritic-like characters as keys, we could name it as diacritics, and implement it just with a object literal:

var diacritics = {'\u0bbf':true,'\u0bcd':true,...};

Then do this:

var tempList = "கதிரவன்".split('');
var targetList = [];
for(var idx in tempList){
  if(diacritics[tempList[idx]])
    targetList[targetList.length - 1] +=  tempList[idx];
  else
    targetList.push(tempList[idx]);
}

We don't even need a tempList, just loop over the str char by char will do the job:

for(var i = 0; i != str.length; ++i){
  var ch = str[i];
  diacritics[ch] ? (targetList[targetList.length - 1] += ch) : targetList.push(ch);
}

他のヒント

Have you tried a unicode library like https://github.com/reyesr/javascript-unicode that gives methods related to unicode types, for instance testing for punctuation or separator chars and split according to it (you won't be able to use the String.split() method though I guess). Or, make a big regex with all the separator chars from the unicode table, and use it to split your text. I think you're not short of options, although you're right about the lack of native support.

I fear that your best solution will be to build and use a web service to do the job. Porting the necessary data and algorithm into javascript would be a daunting project.

This would be quiet bulky to do manually from javascript as javascript itself, though being unicode, interprets strings as ascii. For information on why this is not an option and a possible work around see this post.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top