Converting UTF to RTF escape sequences in .Net

https://stackoverflow.com/questions/19174872

30-06-2022
|

Question

I have some UTF Cyrillic text that needs to be inserted in an RTF file. RTF files tend to store Cyrillic text as escape sequences, either using \'00 or \u0000.

Since the text is in .NET, I'm guessing it's UTF-16. As a specific example, I have this text "4 окт 2013". The OKT is Cyrillic text.

Using the о as an example, the Unicode decimal is 1086 and the UTF-8 decimal is 208 190.

What I would like to do is have a Regex (in .NET) to recognise characters like this, that need to be converted to RTF escape sequences, because they can be natively recognised.

What Regex options are available in .NET to assist with recognising characters like this?

La solution

I was able to use a Regex that matched all non basic latin, and convert to RTF unicode escape sequences.

const string RTFSpecialsInUTF = @"(\P{IsBasicLatin})";

private static Regex UTFSpecialRegex = new Regex(RTFSpecialsInUTF, RegexOptions.Compiled);

private static string ReplaceDirect(Match match) {
    int codepoint = (int)Convert.ToChar(match.Groups[1].Value);
    if (!(codepoint < 32768)) {
        codepoint = codepoint - 65536;
    }
        return string.Format("\\u{0}?", codepoint);
}

/* Usage */
value = UTFSpecialRegex.Replace(value, new MatchEvaluator(PDFDocumentRTF.ReplaceDirect));

Keeping my fingers crossed that this will work for other languages that don't fit into Basic Latin and RTF very well (like Arabic).

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow