문제

I have a .NET plugin which needs to get the text of the current buffer. I found this page, which shows a way to do it:

public static string GetDocumentText(IntPtr curScintilla)
{
    int length = (int)Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
    StringBuilder sb = new StringBuilder(length);
    Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);
    return sb.ToString();
}

And that's fine, until we reach the character encoding issues. I have a buffer that is set in the Encoding menu to "UTF-8 without BOM", and I write that text to a file:

System.IO.File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString());

when I open that file (in notepad++) the encoding menu shows UTF-8 without BOM but the ß character is broken (ß).

I was able to get as far as finding the encoding for my current buffer:

int currentBuffer = (int)Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETCURRENTBUFFERID, 0, 0);
Console.WriteLine("currentBuffer: " + currentBuffer);
int encoding = (int) Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETBUFFERENCODING, currentBuffer, 0);
Console.WriteLine("encoding = " + encoding);

And that shows "4" for "UTF-8 without BOM" and "0" for "ASCII", but I cannot find what notepad++ or Scintilla thinks those values are supposed to represent.

So I'm a bit lost for where to go next (Windows not being my natural habitat). Anyone know what I'm getting wrong, or how to debug it further?

Thanks.

도움이 되었습니까?

해결책

Removing the StringBuilder fixes this problem.

public static string GetDocumentTextBytes(IntPtr curScintilla) {

    int length = (int) Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
    byte[] sb = new byte[length];

    unsafe {
        fixed (byte* p = sb) {

            IntPtr ptr = (IntPtr) p;

            Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, ptr);
        }

        return System.Text.Encoding.UTF8.GetString(sb).TrimEnd('\0');
    }
}

다른 팁

Alternative approach:

The reason for the broken UTF-8 characters is that this line..

Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);

..reads the string using [MarshalAs(UnmanagedType.LPStr)], which uses your computer's default ANSI encoding when decoding strings (MSDN). This means you get a string with one character per byte, which breaks for multi-byte UTF-8 characters.

Now, to save the original UTF-8 bytes to disk, you simply need to use the same default ANSI encoding when writing the file:

File.WriteAllText(@"C:\Users\davet\BBBBBB.txt", sb.ToString(), Encoding.Default);
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top