Skipping particular positions in a string using substitution operator in perl

https://stackoverflow.com/questions/11914499

25-06-2021
|

Question

Yesterday, I got stuck in a perl script. Let me simplify it, suppose there is a string (say ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD), first I've to break it at every position where "E" comes, and secondly, break it specifically where the user wants to be at. But, the condition is, program should not cut at those sites where E is followed by P. For example there are 6 Es in this sequence, so one should get 7 fragments, but as 2 Es are followed by P one will get 5 only fragments in the output.

I need help regarding the second case. Suppose user doesn't wants to cut this sequence at, say 5th and 10th positions of E in the sequence, then what should be the corresponding script to let program skip these two sites only? My script for first case is:

my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';

$otext=~ s/([E])/$1=/g; #Main cut rule.

$otext=~ s/=P/P/g;

@output = split( /\=/, $otext);

print "@output";

Please do help!

Solution

To split on "E" except where it's followed by "P", you should use Negative look-ahead assertions.

From perldoc perlre "Look-Around Assertions" section:

(?!pattern)
A zero-width negative look-ahead assertion.
For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar".

my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD'; 
#                E    E    EP    E    EP    E
my @output=split(/E(?!P)/, $otext); 
use Data::Dumper; print Data::Dumper->Dump([\@output]);"

$VAR1 = [
          'ABCD',
          'ABCD',
          'ABCDEPABCD',
          'ABCDEPABCD',
          'ABCD'
        ];

Now, in order to NOT cut at occurences #2 and #4, you can do 2 things:

Concoct a really fancy regex that automatically fails to match on given occurence. I will leave that to someone else to attempt in an answer for completeness sake.

Simply stitch together the correct fragments.

I'm too brain-dead to come up with a good idiomatic way of doing it, but the simple and dirty way is either:

  my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
  my @output_final;
  for(my $i=0; $i < @output; $i++) {
      if ($no_cuts{$i}) {
          $output_final[-1] .= $output[$i];
      } else {
          push @output_final, $output[$i];
      } 
  }
  print Data::Dumper->Dump([\@output_final];

  $VAR1 = [
            'ABCD',
            'ABCDABCDEPABCD',
            'ABCDEPABCDABCD'
          ];

Or, simpler:

  my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
  for(my $i=0; $i < @output; $i++) {
      $output[$i-1] .= $output[$i]; 
      $output[$i]=undef; # Make the slot empty
  }
  my @output_final = grep {$_} @output; # Skip empty slots
  print Data::Dumper->Dump([\@output_final];

  $VAR1 = [
            'ABCD',
            'ABCDABCDEPABCD',
            'ABCDEPABCDABCD'
          ];

OTHER TIPS

Here's a dirty trick that exploits two facts:

normal text strings never contain null bytes (if you don't know what a null byte is, you should as a programmer: http://en.wikipedia.org/wiki/Null_character, and nb. it is not the same thing as the number 0 or the character 0).
perl strings can contain null bytes if you put them there, but be careful, as this may screw up some perl internal functions.

The "be careful" is just a point to be aware of. Anyway, the idea is to substitute a null byte at the point where you don't want breaks:

my $s = "ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD";

my @nobreak = (4,9);

foreach (@nobreak) {
    substr($s, $_, 1) = "\0";
}

"\0" is an escape sequence representing a null byte like "\t" is a tab. Again: it is not the character 0. I used 4 and 9 because there were E's in those positions. If you print the string now it looks like:

ABCDABCDABCDEPABCDEABCDEPABCDEABCD

Because null bytes don't display, but they are there, and we are going to swap them back out later. First the split:

my @a = split(/E(?!P)/, $s);

Then swap the zero bytes back:

$_ =~ s/\0/E/g foreach (@a);

If you print @a now, you get:

ABCDEABCDEABCDEPABCD
ABCDEPABCD
ABCD

Which is exactly what you want. Note that split removes the delimiter (in this case, the E); if you intended to keep those you can tack them back on again afterward. If the delimiter is from a more dynamic regex it is slightly more complicated, see here:

http://perlmeme.org/howtos/perlfunc/split_function.html

"Example 9. Keeping the delimiter"

If there is some possibility that the @nobreak positions are not E's, then you must also keep track of those when you swap them out to make sure you replace with the correct character again.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow