Question

I have a string like this:

<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>

I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.

I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:

<v1>aaa<b>bbb</b>ccc</v1>

I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.

Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:

If you mean, find the right-most match of several (similar to the rfind method of a string) then no, it is not directly supported. You could use re.findall() and chose the last match but if the matches can overlap this may not give the correct result.

But .rstrip is not good with words, and won't do patterns either.

I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.

What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?

Which strategy to follow to strip the patterns from the end of the string?

Was it helpful?

Solution

The simplest would be to use old-fashing string splitting and limiting the split:

in_str.split('>', 3)[-1].rsplit('<', 3)[0]

Demo:

>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'

str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.

OTHER TIPS

You've already got practically all the solution. re can't do backwards, but you can:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]

print in_str
<v1>aaa<b>bbb</b>ccc</v1>

Note the reversed regex for the reversed string, but then it goes back-to-front.

Of course, as mentioned, this is way easier with a proper parser:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>

I would look into regular expressions and use one such pattern to use a split

http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split

Sorry, can't comment, but will give it as an answer.

in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>. You just should be aware of this.

To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top