Question

I have a .NET plugin which needs to get the text of the current buffer. I found this page, which shows a way to do it:

public static string GetDocumentText(IntPtr curScintilla)
{
    int length = (int)Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
    StringBuilder sb = new StringBuilder(length);
    Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);
    return sb.ToString();
}

And that's fine, until we reach the character encoding issues. I have a buffer that is set in the Encoding menu to "UTF-8 without BOM", and I write that text to a file:

System.IO.File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString());

when I open that file (in notepad++) the encoding menu shows UTF-8 without BOM but the ß character is broken (ß).

I was able to get as far as finding the encoding for my current buffer:

int currentBuffer = (int)Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETCURRENTBUFFERID, 0, 0);
Console.WriteLine("currentBuffer: " + currentBuffer);
int encoding = (int) Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETBUFFERENCODING, currentBuffer, 0);
Console.WriteLine("encoding = " + encoding);

And that shows "4" for "UTF-8 without BOM" and "0" for "ASCII", but I cannot find what notepad++ or Scintilla thinks those values are supposed to represent.

So I'm a bit lost for where to go next (Windows not being my natural habitat). Anyone know what I'm getting wrong, or how to debug it further?

Thanks.

Was it helpful?

Solution

Removing the StringBuilder fixes this problem.

public static string GetDocumentTextBytes(IntPtr curScintilla) {

    int length = (int) Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
    byte[] sb = new byte[length];

    unsafe {
        fixed (byte* p = sb) {

            IntPtr ptr = (IntPtr) p;

            Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, ptr);
        }

        return System.Text.Encoding.UTF8.GetString(sb).TrimEnd('\0');
    }
}

OTHER TIPS

Alternative approach:

The reason for the broken UTF-8 characters is that this line..

Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);

..reads the string using [MarshalAs(UnmanagedType.LPStr)], which uses your computer's default ANSI encoding when decoding strings (MSDN). This means you get a string with one character per byte, which breaks for multi-byte UTF-8 characters.

Now, to save the original UTF-8 bytes to disk, you simply need to use the same default ANSI encoding when writing the file:

File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString(), Encoding.Default);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top