Question

I have a string read as a UTF8 (not from a file, can't check BOM). The problem is that sometimes the original text was formed with another encoding, but was converted to UTF8 - so the string is not readable, sort of gibberish.

is it possible to detect that this string is not actual UTF8?
Thanks!

Was it helpful?

Solution

No. They're just bytes. You could try to guess, if you wanted, by trying different conversions and seeing whether there are valid dictionary words, etc., but in a theoretical sense it's impossible without knowing something about the data itself, i.e. knowing that it never uses certain characters, or always uses certain characters, or that it contains mostly words found in a given dictionary, etc. It might look like gibberish to a person, but the computer has no way of quantifying "gibberish".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top