Question

So, here's my question:

I'm trying to replace anything in a string that is not an "A" or a "C" case insensitively. My strings are all three characters. (In reality the specific two letters will change, which is why I'm not hardcoding the negated values.)

So, I thought I'll do

re.sub(r'[ac]', "X", "ABC", re.IGNORECASE)

But, what I got back was 'XXC'. I expected 'AXC'.

The full range of my data would be

map(lambda s: re.sub(r'[^ac]', "X", s, re.IGNORECASE), [ "ABC", "ABc", "AbC", "Abc", "aBC", "aBc", "abC", "abc" ])

and what I get back is this:

['XXC', 'XXc', 'XXC', 'XXc', 'aXX', 'aXc', 'aXX', 'aXc']

Why does the re.IGNORECASE replace the "A"'s? and, why does it sometimes replace the C's? (Notice how it turned "abC" to "aXX".

if I do this:

map(lambda s: re.sub(r'[^acAC]', "X", s), [ "ABC", "ABc", "AbC", "Abc", "aBC", "aBc", "abC", "abc" ])

I get what I want:

['AXC', 'AXc', 'AXC', 'AXc', 'aXC', 'aXc', 'aXC', 'aXc']

Must I use r'[^acAC]'?? Is there no way to case insensitively complement a regex?

It's interesting to me also, that in vim, if I put all those strings into a text file and do

:%s/[^ac]/X/gi

I get the right thing. (And, blasphemous as this might be, if I do this in perl:

    #! /usr/bin/perl

    use strict;

    foreach my $gene ( "ABC", "ABc", "AbC", "Abc", "aBC", "aBc", "abC", "abc") {    
            my $replaced = $gene;
            $replaced =~ s/[^ac]/X/gi;
            printf("%s\n", $replaced);
    }

I get

AXC
AXc
AXC
AXc
aXC
aXc
aXC
aXc

So does Ruby:

irb(main):001:0> ["ABC", "ABc", "AbC", "Abc", "aBC", "aBc", "abC", "abc"].collect{|s| s.gsub(/[^ac]/i,"X") }
=> ["AXC", "AXc", "AXC", "AXc", "aXC", "aXc", "aXC", "aXc"]

How can I do the equivalent in python without doing r'[^acAC]'?

Thanks!

Was it helpful?

Solution

Pass flags as a keyword argument not positional argument:

>>> re.sub(r'[^ac]', "X", "ABC", flags=re.IGNORECASE)
'AXC'

Looking at the source code,

def sub(pattern, repl, string, count=0, flags=0):
    """Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used."""
    return _compile(pattern, flags).sub(repl, string, count)

it is clear that when you pass re.IGNORECASE as a positional argument, it is actually getting passed to count. It can be verified by this error:

>>> re.sub(r'[^ac]', "X", "ABC", re.IGNORECASE, count=2)
Traceback (most recent call last):
  File "<ipython-input-82-8b949ec4f925>", line 1, in <module>
    re.sub(r'[^ac]', "X", "ABC", re.IGNORECASE, count=2)
TypeError: sub() got multiple values for keyword argument 'count'

So, as re.IGNORECASE equals 2, you get the output as 'XXC'(only two items gets replaced).

>>> re.IGNORECASE
2
>>> re.sub(r'[^ac]', "X", "ABC", re.IGNORECASE)
'XXC'
>>> re.sub(r'[^ac]', "X", "ABC", count=2)
'XXC'
>>> re.sub(r'[^ac]', "X", "ABC", 2)
'XXC'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top