Question

I am reading files in various formats and languages and I am currently using a small encoding library to take attempt to detect the proper encoding (http://www.codeproject.com/KB/recipes/DetectEncoding.aspx).

It's pretty good, but it still misses occasionally. (Multilingual files)

Most of my potential users have very little understanding of encoding (the best I can hope for is "it has something to do with characters") and are very unlikely to be able to choose the right encoding in a list, so I would like to let them cycle through different encodings until the right one is found just by clicking on a button.

Display problems? Click here to try a different encoding! (Well that's the concept anyway)

What would be the best way to implement something like that?


Edit: Looks like I didn't express myself clearly enough. By "cycling through the encoding", I don't mean "how to loop through encodings?"

What I meant was "how to let the user try different encodings in sequence without reloading the file?"

The idea is more like this: Let's say the file is loaded with the wrong encoding. Some strange characters are displayed. The user would click a button "Next encoding" or "previous encoding", and the string would be converted in a different encoding. The user just need to keep clicking until the right encoding is found. (whatever encoding looks good for the user will do fine). As long as the user can click "next", he has a reasonable chance of solving his problem.

What I have found so far involves converting the string to bytes using the current encoding, then converting the bytes to the next encoding, converting those bytes into chars, then converting the char into a string... Doable, but I wonder if there isn't an easier way to do that.

For instance, if there was a method that would read a string and return it using a different encoding, something like "render(string, encoding)".


Thanks a lot for the answers!

Was it helpful?

Solution

Read the file as bytes and use then the Encoding.GetString Method.

        byte[] data = System.IO.File.ReadAllBytes(path);

        Console.WriteLine(Encoding.UTF8.GetString(data));
        Console.WriteLine(Encoding.UTF7.GetString(data));
        Console.WriteLine(Encoding.ASCII.GetString(data));

So you have to load the file only one time. You can use every encoding based on the original bytes of the file. The user can select the correct one und you can use the result of Encoding.GetEncoding(...).GetString(data) for further processing.

OTHER TIPS

(removed original answer following question update)

For instance, if there was a method that would read a string and return it using a different encoding, something like "render(string, encoding)".

I don't think you can re-use the string data. The fact is: if the encoding was wrong, this string can be considered corrupt. It may very easily contain gibberish among the likely looking characters. In particular, many encodings may forgive the presence/absence of a BOM/preamble, but would you re-encode with it? without it?

If you are happy to risk it (I wouldn't be), you could just re-encode your local string with the last encoding:

// I DON'T RECOMMEND THIS!!!!
byte[] preamble = lastEncoding.GetPreamble(),
    content = lastEncoding.GetBytes(text);
byte[] raw = new byte[preamble.Length + content.Length];
Buffer.BlockCopy(preamble, 0, raw, 0, preamble.Length);
Buffer.BlockCopy(content, 0, raw, preamble.Length, content.Length);
text = nextEncoding.GetString(raw);

In reality, I believe the best you can do is to keep the original byte[] - keep offering different renderings (via different encodings) until they like one. Something like:

using System;
using System.IO;
using System.Text;
using System.Windows.Forms;
class MyForm : Form {
    [STAThread]
    static void Main() {
        Application.EnableVisualStyles();
        Application.Run(new MyForm());
    }
    ComboBox encodings;
    TextBox view;
    Button load, next;
    byte[] data = null;

    void ShowData() {
        if (data != null && encodings.SelectedIndex >= 0) {
            try {
                Encoding enc = Encoding.GetEncoding(
                    (string)encodings.SelectedValue);
                view.Text = enc.GetString(data);
            } catch (Exception ex) {
                view.Text = ex.ToString();
            }
        }
    }
    public MyForm() {
        load = new Button();
        load.Text = "Open...";
        load.Dock = DockStyle.Bottom;
        Controls.Add(load);

        next = new Button();
        next.Text = "Next...";
        next.Dock = DockStyle.Bottom;
        Controls.Add(next);

        view = new TextBox();
        view.ReadOnly = true;
        view.Dock = DockStyle.Fill;
        view.Multiline = true;
        Controls.Add(view);

        encodings = new ComboBox();
        encodings.Dock = DockStyle.Bottom;
        encodings.DropDownStyle = ComboBoxStyle.DropDown;
        encodings.DataSource = Encoding.GetEncodings();
        encodings.DisplayMember = "DisplayName";
        encodings.ValueMember = "Name";
        Controls.Add(encodings);

        next.Click += delegate { encodings.SelectedIndex++; };

        encodings.SelectedValueChanged += delegate { ShowData(); };

        load.Click += delegate {
            using (OpenFileDialog dlg = new OpenFileDialog()) {
                if (dlg.ShowDialog(this)==DialogResult.OK) {
                    data = File.ReadAllBytes(dlg.FileName);
                    Text = dlg.FileName;
                    ShowData();
                }
            }
        };
    }
}

Could you let the user enter some words (with "special" characters) that are supposed to occur in the file?

You can search all encodings yourself to see if these words are present.

Beware of the infamous 'Notepad bug'. It's going to bite you whatever you try, though... You can find some good discussions about encodings and their challenges on MSDN (and other places).

You have to keep the original data as a byte array or MemoryStream you can then translate to the new encoding, once you already converted your data to a string you can't reliably return to the original representation.

How about something like this:

public string LoadFile(string path)
{
    stream = GetMemoryStream(path);     
    string output = TryEncoding(Encoding.UTF8);
}

public string TryEncoding(Encoding e)
{
    stream.Seek(0, SeekOrigin.Begin) 
    StreamReader reader = new StreamReader(stream, e);
    return reader.ReadToEnd();
}

private MemoryStream stream = null;

private MemorySteam GetMemoryStream(string path)
{
    byte[] buffer = System.IO.File.ReadAllBytes(path);
    return new MemoryStream(buffer);
}

Use LoadFile on your first try; then use TryEncoding subsequently.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top