Is there a good single byte delimeter for use with utf-8 strings that isn't a null terminator?

https://softwareengineering.stackexchange.com/questions/342688

07-01-2021
|

Pergunta

I'm looking for a quick way to split strings containing individual JSON payloads. Currently, I'm using newlines and searching for the newline ASCII character, but I figure if I start using utf-8 this could easily break.

Is there any quick single byte character that I would be able to use besides a null terminator that I can use to split strings by that won't be thrown off by utf-8 or appear in the JSON payload?

Solução

UTF-8 was specifically designed to be forwards- and backwards-compatible with ASCII, specifically it has these two properties:

the encoding of characters within the ASCII character set is the same in UTF-8 as it is in ASCII
all other codepoints are encoded as a sequence of 2-6 octets, all of which have their high-order bit (8th bit) set; since ASCII only uses 7 bits and always has the 8th bit unset, a single-octet ASCII character can never be mistaken for a part of a multi-octet sequence and vice versa

So, assuming that newlines work reliably for you using ASCII, they will also work reliably using UTF-8.

You will have to deal with different newline conventions of different operating systems, either by accepting all of \r\n (DOS, Windows), \r (Classic MacOS), and \n (Unix), or by specifying one and only one (the Internet Standards all use \r\n, because they are treated as a newline by all OSs, with maybe some additional garbage attached). And this is not even taking into account the various non-ASCII newline characters defined in Unicode.

However, there is a problem: newlines are valid characters in JSON; they can appear in between any two tokens and are ignored as whitespace

AFAICS, it is not that easy to find a character that is guaranteed to not appear in JSON. The spec is a bit vague, it talks about "whitespace" being allowed, but it does not specify what "whitespace" actually means.

One way to get around this, is to enclose the JSON documents into a JSON list, essentially making the JSON objects just elements of an outer JSON array.

Another way would be to switch to a different language: as of version 1.2, YAML is a proper superset of JSON, meaning that every valid JSON document is also a valid YAML document. One of the features YAML has that JSON doesn't, is a document end marker that allows you to put multiple documents into the same bytestream. So, if you just insert a YAML document end marker in between your JSON documents, you have a valid stream consisting of multiple YAML documents.

Outras dicas

If it doesn't appear in your payload, any single-byte ASCII character is a valid separator, because the (ASCII) code points 0 - 127 will be unique, no escaped single bytes will match their values.

See Wikipedia on UTF-8.

Single Byte (ASCII) code points will always be encoded as 0xxxxxxx bits, whereas all bytes of sequences will be encoded as 1xxxxxxx bits.

So, your line break byte (0x0A / dec 10 / bin 00001010) can only appear if you actually but a line feed there.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange