UTF8 Encoding and Network Streams

https://stackoverflow.com/questions/17536537

02-06-2022
|

Question

A client and server communicate with each other via TCP. The server and client send each other UTF-8 encoded messages.

When encoding UTF-8, the amount of bytes per character is variable. It could take one or more bytes to represent a single character.

Lets say that I am reading a UTF-8 encoded message on the network stream and it is a huge message. In my case it was about 145k bytes. To create a buffer of this size to read from the network stream could lead to an OutMemoryException since the byte array needs that amount of sequential memory.

It would be best then to read from the network stream in a while loop until the entire message is read, reading the pieces in to a smaller buffer (probably 4kb) and then decoding the string and concatenating.

What I am wondering is what happens when the very last byte of the read buffer is actually one of the bytes of a character which is represented by multiple bytes. When I decode the read buffer, that last byte and the beginning bytes of the next read would either be invalid or the wrong character. The quickest way to solve this in my mind would be to encode using a non variable encoding (like UTF-16), and then make your buffer a multiple of the amount of bytes in each character (with UTF-16 being a buffer using the power 2, UTF-32 the power of 4).

But UTF-8 seems to be a common encoding, which would leave me to believe this is a solved problem. Is there another way to solve my concern other than changing the encoding? Perhaps using a linked-list type object to store the bytes would be the way to handle this since it would not use sequential memory.

Solution

It is a solved problem. Woot woot!

http://mikehadlow.blogspot.com/2012/07/reading-utf-8-characters-from-infinite.html

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow