Domanda

I have a .NET plugin which needs to get the text of the current buffer. I found this page, which shows a way to do it:

public static string GetDocumentText(IntPtr curScintilla)
{
    int length = (int)Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
    StringBuilder sb = new StringBuilder(length);
    Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);
    return sb.ToString();
}

And that's fine, until we reach the character encoding issues. I have a buffer that is set in the Encoding menu to "UTF-8 without BOM", and I write that text to a file:

System.IO.File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString());

when I open that file (in notepad++) the encoding menu shows UTF-8 without BOM but the ß character is broken (ß).

I was able to get as far as finding the encoding for my current buffer:

int currentBuffer = (int)Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETCURRENTBUFFERID, 0, 0);
Console.WriteLine("currentBuffer: " + currentBuffer);
int encoding = (int) Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETBUFFERENCODING, currentBuffer, 0);
Console.WriteLine("encoding = " + encoding);

And that shows "4" for "UTF-8 without BOM" and "0" for "ASCII", but I cannot find what notepad++ or Scintilla thinks those values are supposed to represent.

So I'm a bit lost for where to go next (Windows not being my natural habitat). Anyone know what I'm getting wrong, or how to debug it further?

Thanks.

È stato utile?

Soluzione

Removing the StringBuilder fixes this problem.

public static string GetDocumentTextBytes(IntPtr curScintilla) {

    int length = (int) Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
    byte[] sb = new byte[length];

    unsafe {
        fixed (byte* p = sb) {

            IntPtr ptr = (IntPtr) p;

            Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, ptr);
        }

        return System.Text.Encoding.UTF8.GetString(sb).TrimEnd('\0');
    }
}

Altri suggerimenti

Alternative approach:

The reason for the broken UTF-8 characters is that this line..

Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);

..reads the string using [MarshalAs(UnmanagedType.LPStr)], which uses your computer's default ANSI encoding when decoding strings (MSDN). This means you get a string with one character per byte, which breaks for multi-byte UTF-8 characters.

Now, to save the original UTF-8 bytes to disk, you simply need to use the same default ANSI encoding when writing the file:

File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString(), Encoding.Default);
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top