Java regex: how to back-reference capturing groups in a certain context when their number is not known in advance

StackOverflow https://stackoverflow.com/questions/21428545

  •  04-10-2022
  •  | 
  •  

Question

As an introductory note, I am aware of the old saying about solving problems with regex and I am also aware about the precautions on processing XML with RegEx. But please bear with me for a moment...

I am trying to do a RegEx search and replace on a group of characters. I don't know in advance how often this group will be matched, but I want to search with a certain context only.

An example: If I have the following string "**ab**df**ab**sdf**ab**fdsa**ab**bb" and I want to search for "ab" and replace with "@ab@", this works fine using the following regex:

Search regex:

(.*?)(ab)(.*?)

Replace:

$1@$2@$3

I get four matches in total, as expected. Within each match, the group IDs are the same, so the back-references ($1, $2 ...) work fine, too.

However, if I now add a certain context to the string, the regex above fails:

Search string:

<context>abdfabsdfabfdsaabbb</context>

Search regex:

<context>(.*?)(ab)(.*?)</context>

This will find only the first match. But even if I add a non-capturing group to the original regex, it doesn't work ("<context>(?:(.*?)(ab)(.*?))*</context>").

What I would like is a list of matches as in the first search (without the context), whereby within each match the group IDs are the same.

Any idea how this could be achieved?

Was it helpful?

Solution

Solution

Your requirement is similar to the one in this question: match and capture multiple instances of a pattern between a prefix and a suffix. Using the method as described in this answer of mine:

(?s)(?:<context>|(?!^)\G)(?:(?!</context>|ab).)*ab

Add capturing group as you need.

Caveat

Note that the regex only works for tags that are only allowed to contain only text. If a tag contains other tags, then it won't work correctly.

It also matches ab inside <context> tag without a closing tag </context>. If you want to prevent this then:

(?s)(?:<context>(?=.*?</context>)|(?!^)\G)(?:(?!</context>|ab).)*ab

Explanation

Let us break down the regex:

(?s)                        # Make . matches any character, without exception
(?:
  <context>
    |
  (?!^)\G
)
(?:(?!</context>|ab).)*
ab

(?:<context>|(?!^)\G) makes sure that we either gets inside a new <context> tag, or continue from the previous match and attempt to match more instance of sub-pattern.

(?:(?!</context>|ab).)* match whatever text that we don't care about (not ab) and prevent us from going past the closing tag </context>. Then we match the pattern we want ab at the end.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top