Java：检测json不正确的控制字符

https://stackoverflow.com/questions/6051509

15-11-2019
|

题

我在java中重塑车轮并在java中创建自己的JSON解析方法。

我正在通过（非常好的！）文档在 json.org 上。我不确定的唯一部分是它说 “或控制字符”

自文件是如此清晰，json非常简单易于实施，我以为我会继续前进并要求规范而不是松动。

如何正确删除java中的控制字符？也许有一个unicode范围？

编辑：（常见的？）丢失peice到拼图
i 已被通知在定义范围 ¹ ²在<script>标签中可能是麻烦的。

最值得注意的是，字符U + 2028和U + 2029，行和段落分隔符，它充当纽诺。将换行符注入一个字符串文字中间，很可能导致语法错误（未被终止的字符串文字）。 ³ 虽然我相信这不会造成XSS威胁，但它仍然是一个很好的想法，为<script>标签添加了额外的规则。

只是简单且编码所有非“ASCII可打印”字符，具有\u表示法。这些角色罕见才能开始。如果您愿意，您可以添加到白名单，但我建议使用白色列表方法。如果您不知道，请勿忘记关于</script（不区分大小写），哪个可能导致html脚本注入与字符生成iconicetagcode。默认情况下，这些字符都没有在JSON中编码。

解决方案

Will Character.isISOControl(...) do? Incidentally, UTF-16 is an encoding of Unicode codepoints... Are you going to be operating at the byte level, or at the character/codepoint level? I recommend leaving the mapping from UTF-16 to character streams to Java's core APIs...

其他提示

Even if it's not very specific, I would assume that they refer to the "control" character category from the Unicode specification.

In Java, you can check if a character c is a Unicode control character with the following expression: Character.getType(c) == Character.CONTROL.

I believe the Unicode definition of a control character is:

The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.

That's their definition of a control code, but the above is followed by the sentence "Also known as control characters.", so...

I know the question has been asked a couple of years ago, but I am replying anyway, because the accepted answer is not correct.

Character.isISOControl(int codePoint)

does the following check:

(codePoint >= 0x00 && codePoint <= 0x1F) || (codePoint >= 0x7F && codePoint <= 0x9F);

The JSON specification defines at https://tools.ietf.org/html/rfc7159:

Strings

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Character.isISOControl(int codePoint)

will flag all characters that need to be escaped (U+0000-U+001F), though it will also flag characters that do not need to be escaped (U+007F-U+009F). It is not required to escape the characters (U+007F-U+009F).

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow