Improve performance of Perl search file script

Question

There are some things that could speed the script up, like removing unneccessary regex usage.

/^P\|/ is equivalent to "P|" eq substr $_, 0, 2.
$foo =~ "BAR" could be -1 != index $foo, "BAR".

Then there is some repeated code. Factoring that out into a sub will not increase performance per se, but makes it easier to reason about the behaviour of the script.

There are a lot of unneccessary stringifications like "$filename" – $filename alone is sufficient.

But the worst offender would be this:

for ( 1 .. $loop ) {
  ...
  open FILE, "<", $filename or ...
  while (<FILE>) {
    ...
  }
  ...
}

You only need to read that file in once, preferably into an array. You can the loop over the indices:

for ( 1 .. $loop ) {
  ...
  for my $i (0 .. $#file_contents) {
    my $line = $file_contents[$i];
    ... # swap $. for $i, but avoid off-by-one error
  }
  ...
}

Disk IO is slow, so cache where you can!

I also see that you are using the $exclude variable as a boolean with the values FALSE and TRUECANC. Why not 0 and 1, so you can use it directly in a conditional?

You can factor out common tests in if/elsif:

if    (FOO && BAR) { THING_A }
elsif (FOO && BAZ) { THING_B }

should be

if (FOO) {
    if    (BAR) { THING_A }
    elsif (BAZ) { THING_B }
}

The $. == $start && $line =~ /^P\|.../ test may be silly, because $start contains only the numbers of lines that start with P| – so the regex may be sufficient here.

Edit

If I have understood the script correctly then the following should yield a significant performance increase:

#!/usr/bin/perl
use strict;
use warnings;

my ($filename, $cancfile) = @ARGV;
open my $fh, "<", $filename or die "$0: Couldn't open $filename: $!";

my (@num, @lines);
while (<$fh>)
{
    push @lines, $_;
    push @num, $#lines if "P|" eq substr $_, 0, 2;
}

open my $outcanc, ">>", $cancfile or die "$0: Couldn't open $cancfile: $!";

for my $i ( 0 .. $#num )
{
    my $start = $num[$i];
    my $end   = ($num[$i+1] // @lines) - 1;
    # pre v5.10:
    # my $end = (defined $num[$i+1] ? $num[$i+1] : @lines) - 1

    if ($lines[$start] =~ /^P[|][0-9]{9}[|]1I[|]IR[|]/) {
        print {$outcanc} @lines[$start .. $end];
    } else {
        print STDOUT     @lines[$start .. $end];
    }
}

The script is cleaned up. The file is cached in an array. Only the parts of the array are iterated that are actually needed – we are down to O(n) from the previous O(n · m).

For your future scripts: Proving behaviour around loops and mutating variables is not impossible, but tedious and annoying. Realizing that

for (1 .. @num) {
  $start = shift @num unless $next;  # aka "do this only in the first iteration"
  $next = shift @num:
  $end = $next - 1:
  while (<FH>) {
    ...
    $test = $. if eof
    ...
  }
  $end = ++test if $end < $start;
  $start = $next if $next;
}

is actually all about circumventing a possible undef in the 2nd shift takes some time. Instead of testing for eof in the inner loop, we can just pick the line number after the loop, so we don't need $test. Then we get:

$start = shift @num;
for my $i (1 .. @num) {
  $end = $num[$i] - 1:

  while (<FH>) { ... }

  $end = $. + 1 if $end < $start;  # $end < $start only true if not defined $num[$i]
  $start = $num[$i] if $num[$i];
}

After translating $i down by one we confine the out-of-bounds problem to one point only:

for my $i (0 .. $#num) {
  $start = $num[$i];
  $end = $num[$i+1] - 1; # HERE: $end = -1 if $i == $#num

  while (<FH>) { ... }
}
$end = $. + 1 if $end < $start;

After replacing the file reading with an array (careful, there is a difference of one between the array index and the line number), we see that the final file reading loop can be avoided if we pull that iteration into the for loop, because we know how many lines there are in total. So to say, we do

$end = ($num[$i+1] // $last_line_number) - 1;

Hopefully my cleaned up code is indeed equivalent to the original.