Question

I would like to know if there is an easy way to detect if the text on the clipboard is in ISO 8859 or UTF-8 ?

Here is my current code:

    COleDataObject  obj;

    if (obj.AttachClipboard())
    {
        if (obj.IsDataAvailable(CF_TEXT))
        {
            HGLOBAL hmem = obj.GetGlobalData(CF_TEXT);
            CMemFile sf((BYTE*) ::GlobalLock(hmem),(UINT) ::GlobalSize(hmem));
            CString buffer;

            LPSTR str = buffer.GetBufferSetLength((int)::GlobalSize(hmem));
            sf.Read(str,(UINT) ::GlobalSize(hmem));
            ::GlobalUnlock(hmem);

            //this is my string class
            s->SetEncoding(ENCODING_8BIT);
            s->SetString(buffer);
        }
    }
}
Was it helpful?

Solution

Check out the definition of CF_LOCALE at this Microsoft page. It tells you the locale of the text in the clipboard. Better yet, if you use CF_UNICODETEXT instead, Windows will convert to UTF-16 for you.

OTHER TIPS

UTF-8 has a defined structure for non-ASCII bytes. You can scan for bytes >= 128, and if any are detected, check if they form a valid UTF-8 string.

The valid UTF-8 byte formats can be found on Wikipedia:

Unicode             Byte1           Byte2           Byte3           Byte4
U+000000-U+00007F   0xxxxxxx
U+000080-U+0007FF   110xxxxx        10xxxxxx
U+000800-U+00FFFF   1110xxxx        10xxxxxx        10xxxxxx
U+010000-U+10FFFF   11110xxx        10xxxxxx        10xxxxxx        10xxxxxx

old answer:

You don't have to -- all ASCII text is valid UTF-8, so you can just decode it as UTF-8 and it will work as expected.

To test if it contains non-ASCII characters, you can scan for bytes >= 128.

I can be mistaken, but I think you cannot: if I open an UTF-8 file without Bom in my editor, it is displayed by default as ISO-8859-1 (my locale), and beside some strange use of foreign (for me) accented chars, I have no strong visual hint that it is UTF-8 (unless it is encoded in another way elsewhere, eg. charset declaration in HTML or XML): it is perfectly valid Ansi text.

John wrote "all ASCII text is valid UTF-8" but the reverse is true.

Windows XP+ uses naturally UTF-16, and have a clipboard format for it, but AFAIK it just ignore UTF-8, with no special treatment for it.
(Well, there is an API to convert UTF-8 to UTF-16 (or Ansi, etc.), actually).

You could check to see obj.IsDataAvailable(CF_UNICODETEXT) to see if a unicode version of what's on the clipboard is available.

-Adam

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top