Reading chars from a stream of ByteArrays where boundary alignment may be imperfect

StackOverflow https://stackoverflow.com/questions/22917492

  •  29-06-2023
  •  | 
  •  

문제

I'm working with asynchronous IO on the JVM, wherein I'm occasionally handed a byte array from an incoming socket. Concatenated, these byte arrays form a stream which my overall goal is to split into strings by instance of a given character, be it newline, NUL, or something more esoteric.

I do not have any guarantee that the boundaries of these consecutive byte arrays are not part of the way through a multi-byte character.

Reading through the documentation for java.nio.CharBuffer, I don't see any explicit semantics given as to how trailing partial multibyte characters are handled.

Given a series of ByteBuffers, what's the best way to get (complete) characters out of them, understanding that a character may span the gap between two sequencial ByteBuffers?

도움이 되었습니까?

해결책

Use a CharsetDecoder:

final Charset charset = ...
final CharsetDecoder decoder = charset.newDecoder()
    .onUnmappableCharacter(CodingErrorAction.REPORT)
    .onMalformedInput(CodingErrorAction.REPORT);

I do have this problem in one of my projects, and here is how I deal with it.

Note line 258: if the result is a malformed input sequence then it may be an incomplete read; in that case, I set the last good offset to the last decoded byte, and start again from that offset.

If, on the next read, I fail to read again and the byte offset is the same, then this is a permanent failure (line 215).

Your case is a little different however since you cannot "backtrack"; you'd need to fill a new ByteBuffer with the rest of the previous buffer and the new one and start from there (allocate for oldBuf.remaining() + bufsize and .put() from oldBuf into the new buffer). In my case, my backend is a file, so I can .map() from wherever I want.

So, basically:

  • if you have an unmappable character, this is a permanent failure (your encoding just cannot handle your byte sequence);
  • if you have read the full byte sequence successfully, your CharBuffer will have buf.position() characters in it;
  • if you have a malformed input, it may mean that you have an incomplete byte sequence (for instance, using UTF-8, you have one byte out of a three byte sequence), but you need to confirm that with the next iteration.

Feel free to salvage any code you deem necessary! It's free ;)


FINAL NOTE, since I believe this is important: String's .getBytes(*) methods and constructors from byte arrays have a default CodingErrorAction of REPLACE!

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top