Should a BOM (byte order mark) be added for empty strings (UTF-16 and UTF-32)?

https://stackoverflow.com/questions/23378604

12-07-2023
|

Question

Excluding UTF-8, is there a general understanding, or unspoken of convention, that if a string is empty the encoder can (should) safely omit the BOM.

It seems like it would be a waste for empty strings, especially when sending to a server. Encoding type and byte order would be irrelevant in such a case.

Is there an RFC that specifically addresses BOM for empty strings?

Thank you.

Solution

A BOM is typically used only when there is no other external information about the string's encoding. Makes sense for text files, the data has to be self-describing, but not so much for transmission protocols unless there is no other encoding information available, like the Content-Type header in HTTP, the <meta> tag for HTML, hard-coded by protocol specs or protocol extensions, etc.

For simply storing a string in memory, a BOM is useless if you are tracking the string properly. Also, depending on the particular string type you are actually using, an empty string may or may not be implemented as a NULL pointer, so you might not be able to include a BOM anyway.

And no, there is no RFC about general BOM usage.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow