How to discover what codepage to use when converting RTF hex literals to Unicode

https://stackoverflow.com/questions/3787837

05-10-2019
|

Question

I'm parsing RTF 1.5+ files generated by Word 2003+ that may have content from other languages. This content is usually encoded as hex literals (\'xx). I would like to convert these literals to unicode values.

I know my document's code page by looking for ansicpg (\ansi\ansicpg1252).

When I use the ansicpg codepage to decode to Unicode, many languages (like French) seem to convert to the Unicode char values that I expect.

However when I see Russian text (like below), codepage 1252 decodes the content to jibberish.

\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286 \'d1\'f2\'f0\'e0\'ed\'e8\'f6\'fb \'e1\'e5\'e7 \'ed\'e0\'e7\'e2\'e0\'ed\'e8\'ff. \'dd\'f2 \'e0 \'f1\'f2\'f0\'e0\'ed\'e8\'f6\'e0 \'ed\'e5 \'e4\'ee\'eb\'e6\'ed\'e0 \'ee\'f2\'ee\'e1\'f0\'e0\'e6\'e0\'f2\'fc\'f1\'ff \'e2 \'f2\'e0\'e1\'eb\'e8\'f6\'e5 \'e2 \'f1\'ee\'e4\'e5\'f0\'e6\'e0\'ed\'e8\'e8.

I assume that lang1049, langfe1033, langnp1049 should provide me clues so I can programmatically choose a different (non-default) code page for the text that they reference? If so, where can I find information that explains how to map a lang* code to a codepage? Or should I be looking for some other RTF command/directive to provide me with the information I'm looking for? (Or must I use \f277 as a font reference and see if it has an associated codepage?)

Solution

\lang really only marks up particular stretches of the text as being in a particular language, and shouldn't impact what code page is to be used for the old non-Unicode \' escapes.

Putting an \ansicpg token in the header should perhaps do it, but seems to be ignored by Word (for both raw bytes and \' escapes.

Or must I use \f277 as a font reference and see if it has an associated codepage?

It looks that way. Changing the \fcharset of the font assigned to a particular stretch of text is the only way I can get Word to change how it treats the bytes, anyway. The codes in this token (see eg here for list) are, aggravatingly, different again from either the language ID or the code page number.

OTHER TIPS

It is not so clear but you can use the RichEdit control in order to convert the RTF to UTF-8 format according to the MSDN: http://msdn.microsoft.com/en-us/library/windows/desktop/bb774304(v=vs.85).aspx Take a look to the SF_USECODEPAGE for the EM_STREAMOUT message.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow