Word Segmentation using ICU

https://stackoverflow.com/questions/13494910

01-12-2021
|

Domanda

I am using ICU4C to transliterate CJK. I am wondering whether it is possible to have word segmentation in ICU, to split Chinese text into a sequence of words, defined according to some word segmentation standard.

When I try transliterating for example:

直接输出html代码而不是作为函数返回值代后处理

using

Transliterator* myTrans = 
                  Transliterator::createInstance("zh-Latin",UTRANS_FORWARD, err);
UnicodeString str;
str.setTo("直接输出html代码而不是作为函数返回值代后处理");
myTrans->transliterate(str);
str.toUTF8String(st);
std::cout << st << std::endl;

I get the following output:

zhí jiē shū chū html dài mǎ ér bù shì zuò wèi hán shù fǎn huí zhí dài hòu chù lǐ

It seems perfectly fine checking against online pinyin tools, but my problem is ICU's transliteration the characters one by one. What I'm looking for, though, is something more like the text below (I don't know any Chinese, so probably the text below doesn't mean anything, but it should demonstrate what kind of output I'm interested in):

zhíjiē shūchū html dàimǎér bùshì zuò wèihán shùfǎn huízhídài hòu chùlǐ

I have been told that ICU 50 is capable of word segmentation, but I couldn't find any document in their web page neither on web. Wanted to know if any of you guys have worked with word segmentation in ICU or know how to do it, or if you have any good link on how to do so.

Soluzione

"Dictionary Based Iterator" isn't a different API. Just create an ICU word break iterator with the appropriate locale ID.

There's a C/C++ sample that comes with ICU in icu/source/samples/break

Also the following sample code shows word breaking: http://source.icu-project.org/repos/icu/icuapps/trunk/iucsamples/c/s24_brkw/s24_brkw.cpp http://source.icu-project.org/repos/icu/icuapps/trunk/iucsamples/c/s23_brki/

probably something like this:

  BreakIterator *wordIterator = BreakIterator::createWordInstance(Locale("zh"), status);
UnicodeString text = "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.";
  wordIterator->setText(text);
  int32_t breakCount = 0;
    int32_t start = wordIterator->first();
    for(int32_t end = wordIterator->next();
        end != BreakIterator::DONE;
        start = end, end = wordIterator->next())
    {
         breakCount++;
    }
  delete wordIterator;

Altri suggerimenti

This is the reply I got from ICU's mailig list:

"There's a brand new online demo in progress also, that does the segmentation and splits your text as the following - when Chinese is selected. hope this helps."

直接
输出
html
代码
而不是
作为
函数
返回
值
代
后
处理

This would solve my problem, I need to transliterate this output to get What I look for.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow