Unicode Regex in Scala REPL

Question 1

Probably a non-compatible character encoding used within the interpreter. For example, here's my output:

scala> System.getProperty("file.encoding")
res0: String = UTF-8

scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
res1: Boolean = true

So the solution is to run scala with -Dfile.encoding=UTF-8. Note, however, this blog post (which is a bit old) :

The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application:

$ JAVA_OPTS="-Dfile.encoding=utf8" scala [...] Just trying to set scala -Dfile.encoding=utf8 doesn't seem to do it. [...]

Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", e.g.:

scala> println("a\u0308")                                                                                             
ä                                                                                                                                                                                                                    
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches()                                         
res1: Boolean = false

This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). For more info, see Paul's question.

Question 2

You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS

This means you can use character classes such as '\w' to match Unicode characters like this:

"(?U)\\w+".r.findFirstIn("pässi")

In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp.

This flag is supported starting from Java 7.