Java strings are always encoded as UTF-16, regardless of where the string data comes from.
It is important that you correctly identify the charset of the source data when converting it to a Java string. new String(query.getBytes(), "UTF-8")
will work fine if the byte[]
array is actually UTF-8 encoded. If you specify the wrong charset, you will get an UnsupportedEncodingException
error only if you specify a charset that Java does not support. However, if you specify a charset that Java does support, and then the decoding of the data fails (typically because you specified the wrong charset for the data), you will get other errors instead, such as MalformedInputException
or UnmappableCharacterException
, or worse you will not get any errors at all and malformed/illegal bytes will simply be converted to the Unicode U+FFFD
replacement character instead. If you need more control over error handling during the conversion process, you need to use the CharsetDecoder
class instead.
Sometimes UTF-encoded files will have a BOM in the front, so you can check for that. But Ansi files do not use BOMs. If a UTF BOM is not present in the file, then you have to either analyze the raw data and take a guess (which will lead to problems if you guess wrong), or simply ask the user which charset to use.
Always know the charset of your data. If you don't know, ask. Avoid guessing.