replace some part of a word with regex

https://stackoverflow.com/questions/4149517

08-10-2019
|

Question

how do you delete text inside <ref> *some text*</ref> together with ref itself?

in '...and so on<ref>Oxford University Press</ref>.'

re.sub(r'<ref>.+</ref>', '', string) only removes <ref> if <ref> is followed by a whitespace

EDIT: it has smth to do with word boundaries I guess...or?

EDIT2 What I need is that it will math the last (closing) </ref> even if it is on a newline.

Solution

I don't really see you problem, because the code pasted will remove the <ref>...</ref> part of the string. But if what you mean is that and empty ref tag is not removed:

re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')

Then what you need to do is change the .+ with .*

A + means one or more, while * means zero or more.

From http://docs.python.org/library/re.html:

'.' (Dot.) In the default mode, this matches any character except a newline.
    If the DOTALL flag has been specified, this matches any character including
    a newline.
'*' Causes the resulting RE to match 0 or more repetitions of the preceding
    RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
    followed by any number of ‘b’s.
'+' Causes the resulting RE to match 1 or more repetitions of the preceding
    RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
    not match just ‘a’.
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
    ab? will match either ‘a’ or ‘ab’.

OTHER TIPS

You could make a fancy regex to do just what you intend, but you need to use DOTALL and non-greedy search, and you need to understand how regexes work in general, which you don't.

Your best option is to use string methods rather than regexes, which is more pythonic anyway:

while '<reg>' in string:
    begin, end = string.split('<reg>', 1)
    trash, end = end.split('</reg>', 1)
    string = begin + end

If you want to be very generic, allowing strange capitalization of the tags or whitespaces and properties in the tags, you shouldn't do this either, but invest in learning a html/xml parsing library. lxml currently seems to be widely recommended and well-supported.

You might want to be cautious not to remove a whole lot of text just because there are more than one closing </ref>s. Below regex would be more accurate in my opinion:

r'<ref>[^<]*</ref>'

This would prevent the 'greedy' matching.

BTW: There is a great tool called The Regex Coach to analyze and test your regexes. You can find it at: http://www.weitz.de/regex-coach/

edit: forgot to add code tag in the first paragraph.

If you try to do this with regular expressions you're in for a world of trouble. You're effectively trying to parse something but your parser isn't up to the task.

Matching greedily across strings probably eats up too much, as in this example:

<ref>SDD</ref>...<ref>XX</ref>

You'd end up cleraning up the entire middle.

You really want a parser, something like Beautiful Soup.

from BeautifulSoup import BeautifulSoup, Tag
s = "<a>sfsdf</a> <ref>XX</ref> || <ref>YY</ref>"
soup = BeautifulSoup(s)
x = soup.findAll("ref")
for z in x:
  soup.ref.replaceWith('!')
soup # <a>sfsdf</a> ! || !

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow