Word boundaries with extented set of characters

https://stackoverflow.com/questions/8539122

18-03-2021
|

Question

It seems a little strange to me that \w matches [a-zA-Z0-9_]. I wonder why 0-9 and _ are counted between word characters and why - is not counted between word characters.

If I want to split the sentence:

This is counter-example.

with (\w*\b) it will split the word counter-example to two parts. Similarly (count.*?\b) matches only counter.

Would it be possible to have something like \b with the result that - is included in word characters (\w)?

Or did I misunderstood the usage of \b? Are there some examples of standard usage of this?

Solution

The fact that \w matches the underscore along with uppercase and lowercase letters is historical: it is due to the fact that it was first introduced to match C identifiers.

Well, this is true for Java's \w (yes, \w will not match accentuated characters in Java).

\b however is an anchor, and it is not defined by the frontier between what is a word character and a non word character, in fact it is implementation-dependent.

There is not really an anchor which does what you want, but if you want to match words and dashes, your best bet is \w*(-\w*)*.

Again, the normal* (special normal*)* pattern!

(and BTW, \b is a "word anchor" in some dialects only, other implementations define \< and \> instead for the beginning and end of word anchors respectively)

[edit for a gross error]

OTHER TIPS

Use this: [\w-]*

For example you want to match something which ends with e and starts with co

String:

This is counter-example.

Regex:

co[\w-]*e

Match:

counter-example

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow