Question

I'm trying to read UTF-8 input in Perl in an unbuffered way (i.e. as soon as data is available, it should be returned):

die if !binmode STDIN, ':unix:utf8';
my $i;
my $buf;
while ($i = read(STDIN, $buf, 8192)) {
  print "$i\n";
}

However, it doesn't work if the input contains an UTF-8 character split:

$ perl -e '$|=1;print"\xc3";sleep 1;print"\xa1";sleep 1;print"AB"' | perl t.pl

This should print 1 and then 2, but it prints 3, so the buffering is withholding the first character even after it became available.

Is there an easy solution for this in Perl? Or maybe in another scripting language for Unix?

Was it helpful?

Solution

First, you need to change from read to sysread. read reads until it has the requested number of chars, while sysread returns as soon as data are available.

But returning data as soon is arrives means you might have an incomplete UTF-8 character at the end, so you'll have to decode only characters fully received and buffer the rest.

sub decode_utf8_partial {
   my $s = decode('UTF-8', $_[0], Encode::FB_QUIET);
   return undef
      if !length($s) && $_[0] =~ /
         ^
         (?: [\x80-\xBF]
         |   [\xC0-\xDF].
         |   [\xE0-\xEF]..
         |   [\xF0-\xF7]...
         |   [\xF8-\xFF]
         )
      /xs;

    return $s;
}

binmode($fh);

my $buf;
while (1) {
   my $rv = sysread($fh, $buf, 64*1024, length($buf));
   die $! if !defined($rv);
   last if !$rv;

   while (1) {
      # Leaves undecoded part in $buf    
      my $s = decode_utf8_partial($buf);
      die "Bad UTF-8" if !defined($s);
      last if !length($s);

      ... do something with $s ...
   }
}

OTHER TIPS

In utf-8 mode, read does a retry on partial characters. That kind of ruins your particular use of read-on-:unix though. I guess this is a case of "Don't do this".

In this particular case, getc may be of use. That will read the minimum necessary. In other situations, decoding afterwards may be a better option.

This seems to work, though you will almost certainly want to throw in a sleep (perhaps Time::HiRes::sleep) or select into the loop:

die if !binmode STDIN, ':unix:utf8';
use IO::Handle;
die unless STDIN->blocking(0);
my $i;
my $buf;
while (1) {
    $i = read(STDIN, $buf, 8192);
    if ($i) {
        print "$i\n";
    }
    elsif (defined $i) {
        last;
    }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top