how to decode Java strings with Unicode escapes etc. from Scala JavaTokenParsers into unescaped strings?

https://stackoverflow.com/questions/11286191

18-06-2021
|

Question

JavaTokenParsers in Scala provides convenient regexps for matching integer and floating-point numbers, and double-quoted strings. But that's ALL it does. How do I do the obvious thing of converting these strings back into the underlying converting objects? This is pretty easy to do for numbers, using toDouble or toInt, etc. But how do you do the equivalent for strings? E.g. If I type the string

"Unicode \u20ac is a Euro sign, which I would write \\u20ac in a string. \243 is a pound sign.\n\r And \f is a \"form feed\", with embedded quotes.\n\r"

And then I run this through JavaTokenParsers, I'll duly get a string back that correctly parses the embedded quotes, but has a double quote character as its first and last characters, and lots of backslash sequences. How do I get the equivalent Java string with the escape sequences processed? I can't believe there's no library function to do this, but can't find one.

Solution

It seems that there is no such function—at least, none is used in the Scala compiler. That's not a conclusive answer though, maybe a library function was introduced afterwards.

In case you want to read (or copy-n-paste) this code, here's the related code I found. The tokenization logic of the Scala compiler is distributed among different files. The top level method seems to be fetchToken in src/compiler/scala/tools/nsc/ast/parser/Scanners.scala, which in turn delegates to logic in src/compiler/scala/tools/nsc/util/CharArrayReader.scala (one of its ancestors), in particular nextChar and potentialUnicode. Other escapes are handled in getLitChar, again in Scanners.scala.

OTHER TIPS

OK, I looked around a bit. Another question on stackoverflow seems to address a related issue:

How to unescape a Java string literal in Java?

There's some source code there to do this from Tom Christiansen.

I also found that Apache Commons has a function to do this:

org.apache.commons3.lang.StringEscapeUtils.unescapeJava()

You need to use the Commons 3 version if you want octal escapes handled. The version above by Christiansen has more functionality in that it handles common escape sequences seen elsewhere (e.g. Java regexps, Perl and Python escapes, C escapes) that aren't present in Java:

\a for bell (\007), \e for ESC
\UXXXXXXXX for UCS-4 full Unicode codepoints (including those not in the BMP)
\xXX for hexadecimal escapes
\cX for control escapes, e.g. \cH = ^H = \b = \010
\0 for NULL (\000)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow