Which charset should I use to encode and decode 8 bit values?

https://stackoverflow.com/questions/23491827

16-07-2023
|

Question

I have a problem with encoding and decoding specific byte values. I'm implementing an application, where I need to get String data, make some bit manipulation on it and return another String.

I'm currently getting byte[] values by String.getbytes(), doing proper manipulation and then returning String by constructor String(byte[] data). The issue is, when some of bytes have specific values e.g. -120, -127, etc., the coding in the constructor returns ? character, that is byte value 63. As far as I know, these values are ones, that can't be printed on Windows, concerning the fact, that -120 in Java is 10001000, that is \b character according to ASCII table

Is there any charset, that I could use to properly code and decode every byte value (from -128 to 127)?

EDIT: I shall also say, that ISO-8859-1 charset works pretty fine, but does not code language specific characters, such as ąęćśńźżół

Solution

You seem to have some confusion regarding encodings, not specific to Java, so I'll try to help clear some of that up.

There do not exist any charsets nor encodings which use the code points from -128 to 0. If you treat the byte as an unsigned integer, then you get the range 0-255 which is valid for all the cp-* and isoo-8859-* charsets.

ASCII characters are in the range 0-127 and so appear valid whether you treat the int as signed or unsigned.

UTF-8 characters are either in the range 0-127 or double-byte characters with the first byte in the range 128-255.

You mention some Polish characters, so instead of ISO-8859-1 you should encode as ISO-8859-2 or (preferably) UTF-8.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow