Java: Detectar caracteres de controle que não estão corretos para JSON

https://stackoverflow.com/questions/6051509

15-11-2019
|

Pergunta

Eu estou reinventando a roda e criando meus próprios métodos de parse json em Java.

Eu estou indo pela documentação (muito legal) em json.org . A única parte que eu não tenho certeza é onde ele diz "ou caractere de controle"

Como a documentação é tão clara, e JSON é tão simples e fácil de implementar, eu pensei que iria em frente e exigiria a especificação em vez de estar solto.

Como seria corretamente caracteres de controle de tiras em Java? Talvez haja uma faixa unicode?

Editar: um (comumente?) Peça ausente para o quebra-cabeça
I foram informados que existem outros caracteres de controle fora de o intervalo definido ¹ ² que pode ser problemático nas tags <script>.
.
Mais notavelmente os caracteres U + 2028 e U + 2029, linha e separador de parágrafos, que atuam como newlines. Injetar uma nova linha no meio de um literal de string provavelmente causará um erro de sintaxe (literal unterminado). ³
Embora eu acredito que isso não representa uma ameaça XSS, ainda é uma boa ideia adicionar regras extras para o uso em tags <script>.

Basta ser simples e codificar todos os caracteres não "ASCII imprimíveis" com notação de geracodiceetagcode. Esses personagens são incomuns para começar. Se você gosta, você pode adicionar à lista branca, mas eu recomendo uma abordagem de lista branca.
Caso você não esteja ciente, Não se esqueça Sobre o \u (não sensível a maiúsculas e minúsculas), que poderia causar injeção de script HTML à sua página com os personagens </script. Nenhum desses personagens é por padrão codificado em JSON.

Solução

Will Character.isISOControl(...) do? Incidentally, UTF-16 is an encoding of Unicode codepoints... Are you going to be operating at the byte level, or at the character/codepoint level? I recommend leaving the mapping from UTF-16 to character streams to Java's core APIs...

Outras dicas

Even if it's not very specific, I would assume that they refer to the "control" character category from the Unicode specification.

In Java, you can check if a character c is a Unicode control character with the following expression: Character.getType(c) == Character.CONTROL.

I believe the Unicode definition of a control character is:

The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.

That's their definition of a control code, but the above is followed by the sentence "Also known as control characters.", so...

I know the question has been asked a couple of years ago, but I am replying anyway, because the accepted answer is not correct.

Character.isISOControl(int codePoint)

does the following check:

(codePoint >= 0x00 && codePoint <= 0x1F) || (codePoint >= 0x7F && codePoint <= 0x9F);

The JSON specification defines at https://tools.ietf.org/html/rfc7159:

Strings

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Character.isISOControl(int codePoint)

will flag all characters that need to be escaped (U+0000-U+001F), though it will also flag characters that do not need to be escaped (U+007F-U+009F). It is not required to escape the characters (U+007F-U+009F).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow