Pergunta

I need to add "php" to all urls in href="xxx", that don't end with "php".
I use negetive lookahead (?!php):

find = r'href="(.+?)(?!php)"'
replace =  r'href="\1.php"'
re.sub(find, replace, 'href="url"')
re.sub(find, replace, 'href="url.php"')

both add extension:

href="url.php"
href="url.php.php"

Why negative lookahead doesn't work?

Foi útil?

Solução

The following does work:

In [49]: re.sub(r'href="([^"]*?)([.]php)?"', r'href="\1.php"', 'href="url.php"')
Out[49]: 'href="url.php"'

In [50]: re.sub(r'href="([^"]*?)([.]php)?"', r'href="\1.php"', 'href="url"')
Out[50]: 'href="url.php"'

The reason your original regex (.+?)(?!php) doesn't quite work is that it matches url.php as follows:

  • (.+?) matches url.php;
  • at this point the negative lookahead is satisfied since the next character is a double quote.

In other words, .+? consumes the entire filename including the extension, making the lookahead a no-op.

Outras dicas

Negative lookahead means the regexp tries to match next pattern, but does not consume the pattern. Your pattern "(.+?)(?!php)" matches 1 or more number of any characters until it meets ", then tries to match next pattern, which is php. This lookahead will always fail, because the next character is ", and since this is a NEGATIVE lookahead, the whole pattern will succeed.

What you need is the negative lookbehind, ((?<!PATTERN)) which will try to match the pattern AFTER the character is consumed. When it meets ", lookbehind pattern tries to match last 3 characters against pattern php.

In short, please try again with below pattern

find = 'href="(.+?)(?<!php)"'
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top