How to detect charset in Java?

https://stackoverflow.com/questions/9935854

27-05-2021
|

سؤال

A half-year ago i faced with annoying problem. And still couldn't fix it. Problem is lying in log4j-logging, where default charset is utf 8.

Sometimes i receiving messages with different encoding, CP1252. (There's no way to change this). Thus logging in utf8 makes the text unreadable. I can fix the encoding somehow, and this text would be readable in the log.

But if i will apply that "encoding fix" to the normal message, it will be messed up. I need to know if that conversion is really needed. Unfortunately, i have no ideas.

المحلول

As deceze commented there is no reliable way automatically detect encoding of a text.

Most encodings try to use 1 byte for characters, as result same sequence of bytes mean totally different string in different encodings. Pretty much the only thing you can reliably do is to say that "it is not valid UTF8 string", other frequently used encodings do not even have strict rules what byte sequences are/are not valid for it.

You best option is to know encoding of the message. Next option would be to preserve text as byte array next to "utf8 string".

If you have very limited set of encodings to accept (utf8/utf16/cp1252) you can try to use some heuristics to detect - i.e. most English strings in UTF16 will have 0 as every other byte, and you can than try to see if the string is OK as UTF8 - if not - than it is likely the remaining encoding.

نصائح أخرى

Apache Tika includes an open source encoding detector.

There are also commercial alternatives.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow