Question

I want to get word count from a String. It's as simple as that. The catch is that the string can be in an unpredictable language.

So, I need a function of signature int getWordCount(String) with the following sample output -

getWordCount("供应商代发发货") => 7
getWordCount("This is a sentence") => 4

Any help on how to proceed would be appreciated :)

Was it helpful?

Solution 2

The concept of "word" may be trivial or complex. Here is Apache Stanbol Toolkit:

Word Tokenization: The detection of single words is required by the Stanbol Enhancer to process text. While this is trivial for most languages it is a rather complex task for some eastern languages, e.g. Chinese, Japanese, Korean. If not otherwise configured, Stanbol will use whitespaces to tokenize words.

So if the concept of word is linguistic, rather than syntactic, you should use a NLP toolkit

My preferred Java solution is Apache's Open NLP

NOTE: I have used http://www.mdbg.net/chindict/chindict.php?page=worddict to tokenize your example. It implies there are 4 words not seven. I have cut and pasted (rather fragmented):

Original Text Simplified Pīnyīn English definition Add a new word to the dictionary Traditional HSK 供应商 供应商 gōng​yìng​shāng​

supplier

供應商 代
代 dài​

to substitute / to act on behalf of others / to replace / generation / dynasty / age / period / (historical) era / (geological) eon


发 fā​

to send out / to show (one's feeling) / to issue / to develop / classifier for gunshots (rounds)

發 HSK 4

发 fà​

hair / Taiwan pr. [fa3]

髮 发货
发货 fā​huò​

to dispatch / to send out goods

發貨

These first three characters appear to form a single word.

OTHER TIPS

The standard API provides the BreakIterator for this sort of boundary analysis but the Oracle Java 7 locale support doesn't break the sample string.

When I used the ICU4J v51.1 BreakIterator it broke the sample into [供应, 商代, 发, 发, 货].

// import com.ibm.icu.text.BreakIterator;
String sentence = "\u4f9b\u5e94\u5546\u4ee3\u53d1\u53d1\u8d27";
BreakIterator iterator = BreakIterator.getWordInstance(Locale.CHINESE);
iterator.setText(sentence);

List<String> words = new ArrayList<>();
int start = iterator.first();
int end = iterator.next();
while (end != BreakIterator.DONE) {
  words.add(sentence.substring(start, end));
  start = end;
  end = iterator.next();
}
System.out.println(words);

Note: I used Google Translate to guess that "供应商代发发货" was Chinese. Obviously, I don't speak the language so can't comment on the correctness of the output.

If we assume that every language has one (or more) word separator, and you can build regex for those separator, then the problem can be solved like this:

    public String separatorForLanguage(char unicodeChar){
        // Find out in which language unicodeChar falls  
        return ""; // return regex of separator of that language
    }

    public int wordCount(String sentance){
        char unicodeChar = sentance.charAt(0);
        String separator = separatorForLanguage(unicodeChar);

        int count = sentance.split(separator).length;
        if (separator.isEmpty()) {
            count--;
        }

        return count;
    }

Here is snippet in java

public static int getWordCount(String string)
{
    Pattern pattern = Pattern.compile("[\\w']+|[\\u3400-\\u4DB5\\u4E00-\\u9FCC]");
    Matcher matcher = pattern.matcher(string);
    int count = 0;
    while(matcher.find())
        count++;
    return count;                                   
}

Example

//count is 5
int wordCount = getWordCount("this is popcorny's 電腦");

English version

For the English version you can do with a rather simple Regex. I may have missed some custom separators but:

public static int getWordCount(String str) {
    return str.split("[\\s,;-]+").length;
}

Regex explanation:

Split if find any in the group []:

[
\\s Any whitespace character or
, A comma
; or a semi-colon
]
+ Followed by any patterns in the group any number of times

Chinese version

For the Chinese version, you need to identify what the separators are. If you get the Unicode char code of the Chinese separators and add them to the above regex, you will get the desired results.

Tests

System.out.println(getWordCount("This is a sentence"));// 4
System.out.println(getWordCount("This is a sentence")); // 4
System.out.println(getWordCount("This is a     ,,sentence")); // 4
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top