Pergunta

I'm processing some data from XML files in perl and wanna use the FIFO File::Queue to divide and speed up the process. One perl script parses the XML file and prepares JSON output for another script:

#!/usr/bin/perl -w
binmode STDOUT, ":utf8";
use utf8;
use strict;
use XML::Rules;
use JSON;
use File::Queue;

#do the XML magic: %data contains result

my $q = new File::Queue (File => './importqueue', Mode => 0666);
my $json = new JSON;
my $qItem = $json->allow_nonref->encode(\%data);
$q->enq($qItem);

As long %data contains numeric and a-z data only this works fine. But when one of the widechars occurs (eg. ł, ą, ś, ż etc.) i'm getting: Wide character in syswrite at /usr/lib/perl/5.10/IO/Handle.pm line 207.

I have tried to check if the string is valid utf8:

print utf8::is_utf8($qItem). ':' . utf8::valid($qItem)

and I did get 1:1 - so yes I do have the correct utf8 string.

I have find out that the reason could be that syswrite gets the filehandler to the queue file which is not aware to be a :utf8 encoded file.

Am I right? If so is there any way to force File:Queue to use the :utf8 file handler? Maybe the File:Queue is not the best choice - should I use sth else to create FIFO queue between two perl scripts?

Foi útil?

Solução

utf8::is_utf8 does not tell you whether your string is encoded using UTF-8 or not. (That information is not even available.)

>perl -MEncode -E"say utf8::is_utf8(encode_utf8(chr(0xE9))) || 0"
0

utf8::valid does not tell you whether your string is valid UTF-8 or not.

>perl -MEncode -E"say utf8::valid(qq{\xE9}) || 0"
1

Both check some internal storage details. You should never have a need for either.


File::Queue can only transmit strings of bytes. It's up to you to serialise the data you want to transmit into a string.

The primary means of serialising text is character encoding, or just encoding for short. UTF-8 is a character encoding.

For example, the string

dostępu

consists of the following chars (each a Unicode code point):

64 6F 73 74 119 70 75

Not all of those chars fit in bytes, so the string can't be sent using File::Queue. If you were to encode that string using UTF-8, you'd get a string composed of the following chars:

64 6F 73 74 C4 99 70 75

Those chars fit in bytes, so that string can be sent using File::Queue.


JSON, as you used it, returns strings of Unicode code points. As such, you need to apply a character encoding.

File::Queue doesn't provide an option to automatically encode strings for you, so you'll have to do it yourself.

You could use encode_utf8 and decode_utf8 from the Encode module

 my $json = JSON->new->allow_nonref;
 $q->enq(encode_utf8($json->encode(\%data)));
 my $data = $json->decode(decode_utf8($q->deq()));

or you can let JSON do the encoding/decoding for you.

 my $json = JSON->new->utf8->allow_nonref;
 $q->enq($json->encode(\%data));
 my $data = $json->decode($q->deq());

Outras dicas

Looking at the docs.....

perldoc -f syswrite
              WARNING: If the filehandle is marked ":utf8", Unicode
               characters encoded in UTF-8 are written instead of bytes, and
               the LENGTH, OFFSET, and return value of syswrite() are in
               (UTF8-encoded Unicode) characters.  The ":encoding(...)" layer
               implicitly introduces the ":utf8" layer.  Alternately, if the
               handle is not marked with an encoding but you attempt to write
               characters with code points over 255, raises an exception.  See
               "binmode", "open", and the "open" pragma, open.

man 3perl open
use open OUT => ':utf8';
...
with the "OUT" subpragma you can declare the default
       layers of output streams.  With the "IO"  subpragma you can control
       both input and output streams simultaneously.

So I'd guess adding use open OUT=> ':utf8' to the top of your program would help

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top