Question

I've got this Regex that comes out of a groovy code:

(?:[^\p{Alnum}äöü**~D~V~\~_**]|^)

(?:sometext|s\.t\.)

(?:[^\p{Alnum}äöü**~D~V~\~_**]|$$)

The only thing I do not understand is this part:

**~D~V~\~_**

What doas the tilde do in there? Is that an error? Or just some switch for the character class?

My understanding is that the first and the third line do match word boundries, while the second matches the text in question (in long and short form).

I tried googling this (and searched here of course) but unfortunatly the tile belongs to the "match this" operator in groovy, so all I found here and there were was general information on how to regex something.

Was it helpful?

Solution

The tilde doesn't have any special meaning in groovy or Java regular expressions. Groovy doesn't change the Java interpretation of regexs at all. All the special characters for are listed on the API reference page for java.util.regex.Pattern.

If you remove the \p{Alnum} character class and the escaped tilde, you can more easily see that ~ isn't being treated specially:

assert ("D" ==~ "(?:[^äöü~D~V~_])") == false
assert ("V" ==~ "(?:[^äöü~D~V~_])") == false
assert ("~" ==~ "(?:[^äöü~D~V~_])") == false
assert (" " ==~ "(?:[^äöü~D~V~_])") == true

I'd throw away these regexs. They're clearly wrong and obfuscated with extra characters. Word boundaries can be matched with \b and the \p{Alnum}äöü should almost certainly be \p{Alphabetic}\p{Digit} to handle unicode properly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top