Question

I am trying to read a rtf file & extract the characters in it. E.g. below is the rtf version of ф

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl {\f0\fswiss\fcharset0 Arial;} {\f1\fmodern Courier New;} {\f2\fnil\fcharset2 Symbol;} {\f3\fmodern\fcharset0 Courier New;} {\f4\fswiss\fcharset204 Arial;}} {\colortbl\red0\green0\blue0;\red0\green0\blue255;} \uc1\pard\plain\deftab360 \f0\fs20 \htmlrtf{\f4\fs20\htmlrtf0 \'f4\htmlrtf\f0}\htmlrtf0 \par }

As you can see the encoding in this is Windows-1252

#!/usr/bin/perl
use strict;
use utf8;
use Encode qw(decode encode);

binmode(STDOUT, ":utf8");
my $runtime = chr(0x0444);
   print "theta || ".$runtime." ||";

  my $hexstr = "0xF4";
  my $num = hex $hexstr;
  my $be_num = pack("N", $num);
  $runtime = decode( "cp1252",$be_num);
  print "\n".$runtime."\n";

$runtime = decode( "cp1251",$be_num);
  print "\n".$runtime."\n"

Output

theta || ф ||
ô

ф

As you can see that with cp1252 i am getting ô. Am i missing something ? I wanted to get encoding from the rtf. I expected to print ф but it printed ô

Was it helpful?

Solution

While the global codepage for the document is cp1252 there are local definitions:

  • The \xf4 char is written with font f4: {\f4...\'f4.
  • But the definition for font f4 is: {\f4\fswiss\fcharset204 Arial;}
  • \fcharset204 sets the charset for this font to 204, e.g. Russian, which is codepage 1251 (according to http://msdn.microsoft.com/en-us/library/cc194829.aspx)

And with codepage 1251 you get the expected character ф.

BTW, codepage 1252 is similar to latin-1 and does not have a character ф at all (see http://en.wikipedia.org/wiki/Windows-1252)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top