Javascript regular expression for searching word boundaries in Unicode string

https://stackoverflow.com/questions/7927659

15-02-2021
|

题

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

解决方案

However, the actual problem of separating the Japanese sentence into words is more complicated than it appears, since words are not separated into spaces as is the case, for example, in English.

For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:

私 - watakushi
は - wa
マーケット - maaketto
に - ni
行きました - ikimashita
。 - (period)

A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.

其他提示

\b, as well as \w and \W, isn't Unicode-aware in JavaScript. You have to define your word boundaries as a specific character set. Like (^|$|[\s.,:\u3002]+) or similar.

\u3002 is from ('。'.charCodeAt(0)).toString(16). Is it a punctuation symbol in Japanese?

Or, a contrario, define a Unicode range of word-constructing letters and negate it:

var boundaries = /(^|$|\s+|[^\u30A0–\u30FA]+)/g;

The example katakana range taken from http://www.unicode.org/charts/PDF/U30A0.pdf.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow