Javascript regular expression for searching word boundaries in Unicode string

https://stackoverflow.com/questions/7927659

15-02-2021
|

Question

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

Solution

However, the actual problem of separating the Japanese sentence into words is more complicated than it appears, since words are not separated into spaces as is the case, for example, in English.

For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:

私 - watakushi
は - wa
マーケット - maaketto
に - ni
行きました - ikimashita
。 - (period)

A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.

OTHER TIPS

\b, as well as \w and \W, isn't Unicode-aware in JavaScript. You have to define your word boundaries as a specific character set. Like (^|$|[\s.,:\u3002]+) or similar.

\u3002 is from ('。'.charCodeAt(0)).toString(16). Is it a punctuation symbol in Japanese?

Or, a contrario, define a Unicode range of word-constructing letters and negate it:

var boundaries = /(^|$|\s+|[^\u30A0–\u30FA]+)/g;

The example katakana range taken from http://www.unicode.org/charts/PDF/U30A0.pdf.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow