Python Regular Expression: BackReference

https://stackoverflow.com/questions/10968617

13-06-2021
|

Pergunta

Here is the Python 2.5 code (which replace the word fox with a link<a href="/fox">fox</a>, and it avoided the replacement inside a link):

import re

content="""
<div>
    <p>The quick brown <a href='http://en.wikipedia.org/wiki/Fox'>fox</a> jumped over the lazy Dog</p>
    <p>The <a href='http://en.wikipedia.org/wiki/Dog'>dog</a>, who was, in reality, not so lazy, gave chase to the fox.</p>
    <p>See &quot;Dog chase Fox&quot; image for reference:</p>
    <img src='dog_chasing_fox.jpg' title='Dog chasing fox'/>
</div>
"""

p=re.compile(r'(?!((<.*?)|(<a.*?)))(fox)(?!(([^<>]*?)>)|([^>]*?</a>))',re.IGNORECASE|re.MULTILINE)
print p.findall(content)

for match in p.finditer(content):
  print match.groups()

output=p.sub(r'<a href="/fox">\3</a>',content)
print output

The output is:

[('', '', '', 'fox', '', '.', ''), ('', '', '', 'Fox', '', '', '')]
('', '', None, 'fox', '', '.', '')
('', '', None, 'Fox', None, None, None)

Traceback (most recent call last):
  File "C:/example.py", line 18, in <module>
    output=p.sub(r'<a href="fox">\3</a>',content)
  File "C:\Python25\lib\re.py", line 274, in filter
    return sre_parse.expand_template(template, match)
  File "C:\Python25\lib\sre_parse.py", line 793, in expand_template
    raise error, "unmatched group"
error: unmatched group

I am not sure why the backreference \3 wont work.
(?!((<.*?)|(<a.*?)))(fox)(?!(([^<>]*?)>)|([^>]*?</a>)) works see http://regexr.com?317bn , which is surprising. The first negative lookahead (?!((<.*?)|(<a.*?))) puzzles me. In my opinion, it is not supposed to work. Take the first match it finds, fox in gave chase to the fox.</p>, there is a <a href='http://en.wikipedia.org/wiki/Dog'>dog</a> where matches ((<.*?)|(<a.*?)), and as a negative lookahead, it should return a FALSE. I am not sure I express myself clearly or not.

Thanks a lot!

(Note: I hate using BeautifulSoup. I enjoy writing my own regular expression. I know many people here will say Regular expression is not for HTML processing blah blah. But this is a small program, so I prefer Regular expression over BeautifulSoup)

Solução

I don't know why your expressions don't work, the only thing that I noticed is a lookahead group at the start, which doesn't make much sense to me. This one appears to work well:

import re

content="""fox
    <a>fox</a> fox <p fox> and <tag fox bar> 
    <a>small <b>fox</b> and</a>
fox"""

rr = """
(fox)
(?! [^<>]*>)
(?!
    (.(?!<a))*
    </a
)
"""

p = re.compile(rr, re.IGNORECASE | re.MULTILINE | re.VERBOSE)
print p.sub(r'((\g<1>))', content)

Outras dicas

If you don't like beautifulsoup, try one of these other (X)HTML parsers:

html5lib
elementree
lxml

If you ever plan to, or need to, parse HTML (or variant) it is worth learning these tools.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow