Good Perl style: How to convert UTF-8 C string literals to \xXX sequences

Question 1

I think it's best if you don't use :raw. You are processing text, so you should properly decode and encode. That will be far less error prone, and it will allow your parser to use predefined character classes if you so desire.
You parse as if you expect slashes in the literal, but then you completely ignore then when you escape. Because of that, you could end up with "...\\xC3\xA3...". Working with decoded text will also help here.

So forget "perlish"; let's actually fix the bugs.

use open ':std', ':locale';

sub convert_char {
   my ($s) = @_;
   utf8::encode($s);
   $s = uc unpack 'H*', $s;
   $s =~ s/\G(..)/\\x$1/sg;
   return $s;
}

sub convert_literal {
   my $orig = my $s = substr($_[0], 1, -1);

   my $safe          = '\x20-\x7E';          # ASCII printables and space
   my $safe_no_slash = '\x20-\x5B\x5D-\x7E'; # ASCII printables and space, no \
   my $changed = $s =~ s{
      (?: \\? ( [^$safe] )
      |   ( (?: [$safe_no_slash] | \\[$safe] )+ )
      )
   }{
      defined($1) ? convert_char($1) : $2
   }egx;

   # XXX Assumes $orig doesn't contain "*/"
   return qq{"$s"} . ( $changed ? " /* $orig */" : '' );
}

while (<>) {
   s/(" (?:[^"\\]++|\\.)*+ ")/ convert_literal($1) /segx;
   print;
}

Question 2

Re: a more Perlish way.

You can use arbitrary delimiters for quote operators, so you can use string interpolation instead of explicit concatenation, which can look nicer. Also, counting the number of substitutions is unneccessary: Substitution in scalar context evaluates to the number of matches.

I would have written your (misnomed!) function as

use strict; use warnings;
use Carp;

sub escape_high_bytes {
  my ($orig) = @_;

  # Complain if the input is not a string of bytes.
  utf8::downgrade($orig, 1)
    or carp "Input must be binary data";

  if ((my $changed = $orig) =~ s/([\P{ASCII}\P{Print}])/sprintf '\\x%02X', ord $1/eg) {
    # TODO make sure $orig does not contain "*/"
    return qq("$changed" /* $orig */);
  } else {
    return qq("$orig");
  }
}

The (my $copy = $str) =~ s/foo/bar/ is the standard idiom to run a replace in a copy of a string. With 5.14, we could also use the /r modifier, but then we don't know whether the pattern matched, and we would have to resort to counting.

Please be aware that this function has nothing to do with Unicode or UTF-8. The utf8::downgrade($string, $fail_ok) makes sure that the string can be represented using single bytes. If this can't be done (and the second argument is true), then it returns a false value.

The regex operators \p{...} and the negation \P{...} match codepoints that have a certain Unicode property. E.g. \P{ASCII} matches all characters that are not in the range [\x00-\x7F], and \P{Print} matches all characters that are not visible, e.g. control codes like \x00 but not whitespace.

Your while (<>) loop is arguably buggy: This does not neccessarily iterate over STDIN. Rather, it iterates over the contents of the files listed in @ARGV (the command line arguments), or defaults to STDIN if that array is empty. Note that the :raw layer will not be declared for the files from @ARGV. Possible solutions:

You can use the open pragma to declare default layers for all filehandles.
You can while (<STDIN>).

Do you know what is Perlish? Using modules. As it happens, String::Escape already implements much of the functionality you want.

Question 3

Similar code written in Python

Python 2.7

import re
import sys

def utf8_to_esc(matched):
    s = matched.group(1)
    s2 = s.encode('string-escape')
    result = '"{}"'.format(s2)
    if s != s2:
        result += ' /* {} */'.format(s)
    return result

sys.stdout.writelines(re.sub(r'"([^"]+)"', utf8_to_esc, line) for line in sys.stdin)

Python 3.x

def utf8_to_esc(matched):
    ...
    s2 = s.encode('unicode-escape').decode('ascii')
    ...