Question

I am writing regular expressions for unicode text in Java. However for the particular script that I am using - Devanagari (0900 - 097F) there is a problem with word boundaries. \b matches characters which are dependent vowels(like 093E-094C) as they are treated like space characters.

Example: Suppose I have the string: "कमल कमाल कम्हल कम्हाल" Note that 'मा' in the 2nd word is formed by combining म and ा (recognized as a space character). Similarly in the last word. This leads \b to match the 'ल' in 'कमाल' with regular expression \b\w\b which is not correct according to the language.

I hope the example helps.

Can I write a regular expression that behaves like \b except that it doesn't match certain chars? Any feedback will be grateful.

Was it helpful?

Solution

You should be able to accomplish what you want with the following regex operators:

(?=X)   X, via zero-width positive lookahead
(?!X)   X, via zero-width negative lookahead
(?<=X)  X, via zero-width positive lookbehind
(?<!X)  X, via zero-width negative lookbehind

(The above is quoted from the Java 6 Pattern API docs.)

Use (?<![foo])(?=[foo]) in place of \b before a word, and (?<=[foo])(?![foo]) in place of \b after a word, where "[foo]" is your set of "word characters"

OTHER TIPS

The equivalent for word boundaries (if the boundaries are not what you were expecting for) would be:

 (?<!=[x-y])(<?=[x-y])...(?<=[x-y])(?![x-y])

That is because a "word boundary" means "a location where there is a character on one side and not on the other)

So with look-behind and look-ahead expressions, you can define you own class of characters [x-y] to check when you want to isolate a "word boundary"

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top