Question

i read a list for my android app from a csv or txt file.

If the File is encoded UTF-8 with Notepad++ i seh the list all right. But i cant search/find strings with .equals.

If the file is encoded with widows as ansi, is cant see äöü etc. But now i can find strings.

Now my question. How can i found out what charset my string has?

I compare my frist string (from the file) with another string, read in in the app with searchview.

I "THINK" my searchview string from the app is ansi too, how to change that to UTF-8 and hope that the compare then works, again.

Android 4.4.2

Thank you

following dosent work:

String s = null;
try
{
    s = new String(query.getBytes(), "UTF-8");
}
catch (UnsupportedEncodingException e)
{
    Log.e("utf8", "conversion", e);
}
Was it helpful?

Solution

Java strings are always encoded as UTF-16, regardless of where the string data comes from.

It is important that you correctly identify the charset of the source data when converting it to a Java string. new String(query.getBytes(), "UTF-8") will work fine if the byte[] array is actually UTF-8 encoded. If you specify the wrong charset, you will get an UnsupportedEncodingException error only if you specify a charset that Java does not support. However, if you specify a charset that Java does support, and then the decoding of the data fails (typically because you specified the wrong charset for the data), you will get other errors instead, such as MalformedInputException or UnmappableCharacterException, or worse you will not get any errors at all and malformed/illegal bytes will simply be converted to the Unicode U+FFFD replacement character instead. If you need more control over error handling during the conversion process, you need to use the CharsetDecoder class instead.

Sometimes UTF-encoded files will have a BOM in the front, so you can check for that. But Ansi files do not use BOMs. If a UTF BOM is not present in the file, then you have to either analyze the raw data and take a guess (which will lead to problems if you guess wrong), or simply ask the user which charset to use.

Always know the charset of your data. If you don't know, ask. Avoid guessing.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top