I want to detect words of Unicode Letters (\p{L}).

Scala's REPL gives back false for the following statement, while in Java it's true (which is the right behaviour):

java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()

Both Java and Scala are running in JRE 1.7:

System.getProperty("java.version") gives back "1.7.0_60-ea"

What could be the reason for that?

有帮助吗?

解决方案

Probably a non-compatible character encoding used within the interpreter. For example, here's my output:

scala> System.getProperty("file.encoding")
res0: String = UTF-8

scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
res1: Boolean = true

So the solution is to run scala with -Dfile.encoding=UTF-8. Note, however, this blog post (which is a bit old) :

The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application:

$ JAVA_OPTS="-Dfile.encoding=utf8" scala [...] Just trying to set scala -Dfile.encoding=utf8 doesn't seem to do it. [...]


Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", e.g.:

scala> println("a\u0308")                                                                                             
ä                                                                                                                                                                                                                    
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches()                                         
res1: Boolean = false

This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). For more info, see Paul's question.

其他提示

You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS

This means you can use character classes such as '\w' to match Unicode characters like this:

"(?U)\\w+".r.findFirstIn("pässi")

In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp.

This flag is supported starting from Java 7.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top