Pregunta

I have been working hard to get a regular expression to work for me, but I'm stuck on the last part. My goal is to remove an xml element when it is contained within specific parent elements. The example xml looks like so:

<ac:image ac:width="500">
    <ri:attachment ri:filename="image2013-10-31 11:21:16.png">
        <ri:page ri:content-title="Banana Farts" />             /* REMOVE THIS */
    </ri:attachment>
</ac:image>

The expression I have written is:

(<ac:image.*?>)(<ri:attachment.*?)(<ri:page.*? />)(</ri:attachment></ac:image>)

In more readable format, I am searching on four groups

(<ac:image.*?>)                   //Find open image tag
(<ri:attachment.*?)               //Find open attachment tag
(<ri:page.*? />)                  //Find the page tag
(</ri:attachment></ac:image>)     //Find close image and attachment tags

And this basically works because I can remove the page element in notepad++ with:

/1/2/4

My issue is that the search is too greedy. In an example like below it grabs everything from start to finish, when really only the second image tag is a valid find.

<ac:image ac:width="500">
    <ri:attachment ri:filename="image2013-10-31 11:21:16.png" />
</ac:image>
<ac:image ac:width="500">
    <ri:attachment ri:filename="image2013-10-31 11:21:16.png">
        <ri:page ri:content-title="Employee Portal Editor" />
    </ri:attachment>
</ac:image>

Can anyone help me finish this up? I thought all I had to do was add ? to make the closing tag group not greedy, but it failed to work.

¿Fue útil?

Solución

Keep in mind that a regex engine will try all that is possible to make the pattern succeed. Since you use several .*? in your pattern, you let a lot of flexibility to the regex engine to pursue this purpose. The pattern must be more binding.

To do that, you can replace all the .*? with [^>]*

Don't forget to add optional white-spaces between each tag \s* in the pattern.

Example:

(<ac:image[^>]*> \s* <ri:attachment[^>]*> )     # group 1
 \s* <ri:page[^>]*/> \s*                        # what you need to remove
(</ri:attachment> \s* </ac:image>)              # group 2

replacement: $1$2

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top