List of Unicode characters that should be filtered in output?

https://stackoverflow.com/questions/10556875

07-06-2021
|

Question

Recently I hit a bug due to data quality with browser support, and I am looking for a safe rule for applying string escape without double size unless required.

A UTF-8 byte sequence "E2-80-A8" (U+2028, LINE SEPARATOR), a perfectly valid character in a Unicode database. However, that sequence represents a line-separator (Yes, other then "0A").

And badly, many browser (including Chrome, Firefox, and Safari; I didn't test others), failed to process a JSONP callback which has a string that contains that Unicode character. The JSONP was included by a non-Unicode HTML which I did not have any control.

The browsers simply reported INVALID CODE/syntax error on such JavaScript which looks valid from debug tools and all text editors. What I guess is that it may try to convert "E2-80-A8" to BIG-5 and broke JS syntax.

The above is only an example of how Unicode can break your system unexpected. As far as I know, some hacker can use RTL and other control characters for their good. And there are many "quotes", "spaces", "symbols" and "controls" in Unicode specification.

QUESTION:

Is there a list of Unicode characters for every programmer to know about hidden features (and bugs) which we might not want them effective in our application. (e.g. Windows disable RTL in filename).

EDIT:

I am not asking for JSON nor JavaScript. I am asking for general best practice of Unicode handing in all programs.

Solution

There's a database of character properties and a report describing it, the UNICODE CHARACTER DATABASE, that gives a good idea of how browsers "should" treat a code point. I love that word, "should". Safest is going to be a whitelist, you could probably go with L|M|N|S, Letter or Mark or Number or Symbol.

Have a look at the ICU project for a library

OTHER TIPS

It breaks javascript because strings cannot have newlines in them:

var myString = "

";

//SyntaxError: Unexpected token ILLEGAL

Now, the UTF-8 sequence "E2-80-A8" decodes to unicode code point U+2028, which is treated similar to newline in javascript:

 var myString = " ";

//Syntax Error

It is however, safe to write

var myString = "\u2028";
//you can now log myString in console and get real representation of this character

which is what properly encoded JSON will have. I'd look into properly encoding JSON instead of keeping a blacklist of unsafe characters. (which are U+2028 and U+2029 AFAIK).

In PHP:

echo json_encode( chr(0xe2). chr(0x80).chr(0xA8 ) );
//"\u2028"

Look at the Unicode charts. There's a list of non-printing characters. These are the ones that'd be potential troublemakers. Your friend U+2028 has a bunch of friends: http://www.unicode.org/charts/PDF/U2000.pdf And it's not just in the 2000 range.

You could either nuke them all, or separate them into different categories (the SEP chars like U+2028 becoming \n or escaped properly), etc.

HTH

A-Z, a-z and 0-9 are generally safe. Outside those 62 characters, you will run to problems with some system. There's no other answer anyone can give you.

For example, you mention domain names. The only way to handle Unicode domain names is to follow RFC 3454 and RFCs 5890-5893, and process the data that way and only that way. Filenames on most Unix filesystems are arbitrary strings of bytes that don't include / or \0. Functionally treating a filename on Unix as a Unicode string without breaking anything is a question in itself. Note that Windows filenames are not A-Z safe; stuff like NUL and PRN are reserved names. Each domain is going to its own little issues and quirks, and no simple summary is going to suffice for everywhere.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow