Should I be able to quote a leading or trailing dollar sign ($) inside a word boundary in Java Regular Expression?

StackOverflow https://stackoverflow.com/questions/3320077

Question

I'm having trouble getting regular expressions with leading / trailing $'s to match in Java (1.6.20).

From this code:

System.out.println( "$40".matches("\\b\\Q$40\\E\\b") );
System.out.println( "$40".matches(".*\\Q$40\\E.*") );
System.out.println( "$40".matches("\\Q$40\\E") );
System.out.println( " ------ " );
System.out.println( "40$".matches("\\b\\Q40$\\E\\b") );
System.out.println( "40$".matches(".*\\Q40$\\E.*") );
System.out.println( "40$".matches("\\Q40$\\E") );
System.out.println( " ------ " );
System.out.println( "4$0".matches("\\b\\Q4$0\\E\\b") );
System.out.println( "40".matches("\\b\\Q40\\E\\b") );

I get these results:

false
true
true
 ------ 
false
true
true
 ------ 
true
true

The leading false in the first two blocks seem to be the problem. That is, the leading/trailing $ (dollar sign) is not picked up properly in the context of the \b (word boundary) marker.

The true results in the blocks show it's not the quoted dollar sign itself, since replacing the \b with a .* or removing all together get the desired result.

The last two "true" results show that the issue is neither with an internally quoted $ nor with matching on word boundaries (\b) within quoted expression "\Q ... \E".

Is this a Java bug or am I missing something?

Was it helpful?

Solution

This is because \b matches word boundaries. And the position immediately before or after a $ character does not necessarily count as a word boundary.

A word boundary is the position between \w and \W, and $ is not part of \w. On the example of the string "bla$", word boundaries are:

" b l a $ "
 ^----------- here

" b l a $ "
       ^----- here

" b l a $ "
         ^--- but not here

OTHER TIPS

Tomalak nailed it - it's about word boundary matching. I had figured it out and deleted the question, but Will's advice to keep open for others is sound.

The \b was, in fact, the culprit.

One conclusion could be that for anything but the most rudimentary (i.e. ASCII) uses, the built-in convenience expressions from Java are effectively useless. Eg. \w only matches ASCII characters, \b is based on that, etc.

FWIW, my RegExp ended up being:

   (?:^|[\p{P}\p{Z}])(\QThe $earch Term\E)(?:[\p{P}\p{Z}]|$)

where The $earch Term is the text I'm trying to match.

The \p{} are the Unicode categories. Basically, I'm breaking my word on any character in the Punctuation (P) or Separator (Z) Unicode character categories. As well, the start and end of the input are respected (with ^ and $) and the boundary markers are tagged as non-capturing groups (the (?:...) bits) while the actual search term is quoted with \Q and \E & placed in a matching group.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top