Question

I am trying to replace all numbers in a string with '#', provided that they don't have a specific prefix. The numbers may appear as part of a word, or as a word on their own.

For example, using ABC as the prefix, this is the desired result.

Input:

sdkfjsd 12312981 sdkfjsdfhbnmawd 1298 ,smdfsdnfk2342423 
sdlkfsdfs 20349 ABC1203912 2034234aac <-- ABC<number> stays, the other numbers do not
ABC1203912

Result (note that lines 2,3 have ABC with a number):

sdkfjsd # sdkfjsdfhbnmawd # ,smdfsdnfk#
sdlkfsdfs # ABC1203912 #aac <-- ABC<number> stays, the other numbers do not
ABC1203912

I tried to do it with a negative-look behind regexp: s/(?<!ABC)\d+/#/g. In this case only the first digit after ABC will not be replaced, the rest will.

My next step would be to split the string into parts that contain ABC\d+ , and perform a simple replace on the other parts.

Will appreciate any advice how to do the whole thing without splitting into multiple strings.

Thanks!

Edit 1: moved aac back to proper position. Edit 2: I am using perl 5.8.5, in case this is relevant. I can't update to a newer version due to compatibility issues with code that I don't control.

Was it helpful?

Solution

I don't understand what you mean by "My next step would be to split the string into parts that contain ABC\d+, and perform a simple replace on the other parts.", but it looks like it is not your main issue here. Do let me know otherwise.

To match every digit that is not preceded by the keyword ABC, then you can use this regex:

(?<!ABC|\d)\d+

This prevents the matching of a digit if there is ABC before it, or another digit (thus preventing \d+ to match if starting from the middle of a digit.

regex101 demo

Note that you had two parts of your string in your question moved around. I'm taking only the input that you used.


If the above doesn't work (e.g. the regex engine says the pattern in the lookbehind cannot be of variable width, or something along these lines), then the alternate equivalent is:

(?<!ABC)(?<!\d)\d+

regex101 demo

OTHER TIPS

It's not entirely clear what it is you want, not least because the 2034234aac field has been modified strangely in your example.

But this modification of your own negative look-behind may be useful. Note that it leaves intact any sequence starting with ABC, such as ABCX1234. It's not clear whether that is correct behaviour.

use strict;
use warnings;

my $s = <<'__END_TEXT__';
sdkfjsd 12312981 sdkfjsdfhbnmawd 1298 ,smdfsdnfk2342423 
sdlkfsdfs 20349 ABC1203912 2034234aac <-- ABC<number> stays, the other numbers do not
ABC1203912
__END_TEXT__

$s =~ s/\b(?!ABC)[a-z]*\K\d+/#/gi;

print $s;

or, for versions of Perl 5 earlier than 10, use this

$s =~ s/\b((?!ABC)[a-z]*)\d+/$1#/gi;

output

sdkfjsd # sdkfjsdfhbnmawd # ,smdfsdnfk# 
sdlkfsdfs # ABC1203912 #aac <-- ABC<number> stays, the other numbers do not
ABC#

You need to use a "zero-width negative lookbehind assertion": only match if not immediately preceded by something.

Eg. to match a digit not preceeded by ABC:

(?<!ABC)\d

You have already got this far, but the next step to match prefix and multiple digits:

(?<!ABC)\d+

doesn't directly help, because you need to not match.

So rephrase the question slightly:

Replace a digit not following the prefix and one or more digits

Ie. in "ABC123" you don't want to replace the 1, 2 or 3. And we can extend the zero-width negative lookbehind assertion to include digits:

(?<!ABC\d+)\d

and thus exclude the digits following the prefix as well.

NB this assumes Perl support for variable width lookbehinds: certainly the first extensions of regexes to include lookbehinds they had to be fixed width, but it has been a while since I used Perl regexes seriously so I am assuming the Perl regex implementation has been extended to match other platforms.

EDIT: Oops, s/positive/negative/ lookbehind.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top