"Raw" conversion from double-UTF-8 to UTF-8 (or from UTF-8 to ANSI)

Question 1

The following code uses the low-level encoding functions of Ruby to force the rewriting of double encoded UTF-8 (from CP1525) into normal UTF-8.

#!/usr/bin/env ruby

ec = Encoding::Converter.new(Encoding::UTF_8, Encoding::CP1252)

prev_b = nil

orig_bytes = STDIN.read.force_encoding(Encoding::BINARY).bytes.to_a
real_utf8_bytes = ""
real_utf8_bytes.force_encoding(Encoding::BINARY)

orig_bytes.each_with_index do |b, i|
    b = b.chr

    situation = ec.primitive_convert(b.dup, real_utf8_bytes, nil, nil, Encoding::Converter::PARTIAL_INPUT)

    if situation == :undefined_conversion
            if prev_b != "\xC2"
                    $stderr.puts "ERROR found byte #{b.dump} in stream (prev #{(prev_b||'').dump})"
                    exit
            end

            real_utf8_bytes.force_encoding(Encoding::BINARY)
            real_utf8_bytes << b
            real_utf8_bytes.force_encoding(Encoding::CP1252)
    end

    prev_b = b
end

real_utf8_bytes.force_encoding(Encoding::BINARY)
puts real_utf8_bytes

It is meant to be used in a pipeline:

cat $PROBLEMATIC_FILE | ./fix-double-utf8-encoding.rb > $CORRECTED_FILE

Question 2

echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859-1

Windows-1252 differs from ISO-8859-1 in the 0x80-0x9F range. For example, in your case, 0x81 is U+0081 in ISO 8859-1, but is invalid in Windows-1252.

Check whether the rest of your data was misinterpreted as Windows-1252 or ISO 8859-1. Usually, ISO 8859-1 is more common.