
I am trying to read a rtf file & extract the characters in it. E.g. below is the rtf version of ф

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl {\f0\fswiss\fcharset0 Arial;} {\f1\fmodern Courier New;} {\f2\fnil\fcharset2 Symbol;} {\f3\fmodern\fcharset0 Courier New;} {\f4\fswiss\fcharset204 Arial;}} {\colortbl\red0\green0\blue0;\red0\green0\blue255;} \uc1\pard\plain\deftab360 \f0\fs20 \htmlrtf{\f4\fs20\htmlrtf0 \'f4\htmlrtf\f0}\htmlrtf0 \par }

As you can see the encoding in this is Windows-1252

use strict;
use utf8;
use Encode qw(decode encode);

binmode(STDOUT, ":utf8");
my $runtime = chr(0x0444);
   print "theta || ".$runtime." ||";

  my $hexstr = "0xF4";
  my $num = hex $hexstr;
  my $be_num = pack("N", $num);
  $runtime = decode( "cp1252",$be_num);
  print "\n".$runtime."\n";

$runtime = decode( "cp1251",$be_num);
  print "\n".$runtime."\n"


theta || ф ||


As you can see that with cp1252 i am getting ô. Am i missing something ? I wanted to get encoding from the rtf. I expected to print ф but it printed ô

La solution

While the global codepage for the document is cp1252 there are local definitions:

  • The \xf4 char is written with font f4: {\f4...\'f4.
  • But the definition for font f4 is: {\f4\fswiss\fcharset204 Arial;}
  • \fcharset204 sets the charset for this font to 204, e.g. Russian, which is codepage 1251 (according to

And with codepage 1251 you get the expected character ф.

BTW, codepage 1252 is similar to latin-1 and does not have a character ф at all (see

