Regex select XML Element (containing hyphen) and inside content

https://stackoverflow.com/questions/18361169

26-06-2022
|

Question

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.

What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.

I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.

Here's what I've tried:

(<page-content>)(.*?)

The above will match up until the next starting <page-content> tag, which is not what I want.

(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)

The above finds no matches, even though the below will find the 7 matches it should.

(<content>)(.*?)(<\/content>)

I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.

Thanks!

EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.

Solution

It seems the problem is that your . is not matching newlines that exist between your open and close tags.

An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:

(<page-content>)(?s)(.*?)(<\/page-content>)

More information on modifiers here.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow