Question

I have some SGML that I'm trying to clean up by adding closing tags to the opening ones. Right now, the document has a structure like this:

<CAT>
<NAME>Daniel
<COLOR>White
<DESC>Daniel is a white cat <p>He was born in July</p><br />He's super cute.<p><br />He does not have any siblings.
<COUNTRY>USA
</CAT>

So far I can match an open tag and capture the content as a group using this regexp: <NAME>([^\\<]+)[^<] if doesn't have any <p>, </p>, or <br /> elements within the content area.

But if i do <DESC>([^\\<]+)[^<], the pattern matching stops right before the first <p>

The reason why I'm using < as the end of the pattern is because all the other open nodes don't have html elements that stop the matching

How can I make a regexp that matches the <DESC> node that includes <p>, </p>, <br /> and ends before the <COUNTRY> node?

Était-ce utile?

La solution

How about this:

<DESC>((?:</?p>|<br />|[^\\<])+)

This allows these three tags to match and stops at the next < that doesn't belong to one of the three.

By the way, why aren't you allowing the backslash as a valid character?

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top