Question

How can I find extended ASCII characters in a file using Perl? Can anyone get the script?

.....thanks in advance.....

Was it helpful?

Solution

Since the extended ASCII characters have value 128 and higher, you can just call ord on individual characters and handle those with a value >= 128. The following code reads from stdin and prints only the extended ASCII characters:

while (<>) {
  while (/(.)/g) {
    print($1) if (ord($1) >= 128);
  }
}

Alternatively, unpack together with chr will also work. Example:

while (<>) {
  foreach (unpack("C*", $_)) {
    print(chr($_)) if ($_ >= 128);
  }
}

(I'm sure some Perl guru can condense both of these to two one-liners...)


To print the line numbers instead, you can use the following (this does not remove duplicates, and will have odd behaviour when unicode is passed):

while (<>) {
  while (/(.)/g) {
    print($. . "\n") if (ord($1) >= 128);
  }
}

(Thanks Yaakov Belch for the $. tip.)

OTHER TIPS

The first printable ASCII character is space (32). The last printable ASCII character is ~ (126). So I'd probably use

while (<>) {
  print "$.\n" if /[^ -~]/;
}

although it will, admittedly, also display lines containing control characters as well as extended ASCII.

Edit: Changed to print the line number rather than the line itself.

Oneliner:

perl -nE'say$.if/[\xE0-\xFF]/'

for older perl versions

perl -lne'print$.if/[\xE0-\xFF]/'

A crucial question is whether the

use bytes;

pragma should be in effect. The poster should decide that. For picking characters with codes greater than 127, the following will suffice:

print grep 127 < ord, split // while <>;

or

print grep /[^[:ascii:]]/, split // while <>;

Hynek -Pichi- Vychodil's answer:

perl -nE'say$.if/[\xE0-\xFF]/'

only tests a limited part of the non-printing should presumably be

perl -nE'say$.if/[\x80-\xFF]/'

instead.

What about grep?

grep [\x00-\x1F\x7F-\xFF]+ *
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top