Java regex: how to back-reference capturing groups in a certain context when their number is not known in advance

StackOverflow https://stackoverflow.com/questions/21428545

  •  04-10-2022
  •  | 
  •  

문제

As an introductory note, I am aware of the old saying about solving problems with regex and I am also aware about the precautions on processing XML with RegEx. But please bear with me for a moment...

I am trying to do a RegEx search and replace on a group of characters. I don't know in advance how often this group will be matched, but I want to search with a certain context only.

An example: If I have the following string "**ab**df**ab**sdf**ab**fdsa**ab**bb" and I want to search for "ab" and replace with "@ab@", this works fine using the following regex:

Search regex:

(.*?)(ab)(.*?)

Replace:

$1@$2@$3

I get four matches in total, as expected. Within each match, the group IDs are the same, so the back-references ($1, $2 ...) work fine, too.

However, if I now add a certain context to the string, the regex above fails:

Search string:

<context>abdfabsdfabfdsaabbb</context>

Search regex:

<context>(.*?)(ab)(.*?)</context>

This will find only the first match. But even if I add a non-capturing group to the original regex, it doesn't work ("<context>(?:(.*?)(ab)(.*?))*</context>").

What I would like is a list of matches as in the first search (without the context), whereby within each match the group IDs are the same.

Any idea how this could be achieved?

도움이 되었습니까?

해결책

Solution

Your requirement is similar to the one in this question: match and capture multiple instances of a pattern between a prefix and a suffix. Using the method as described in this answer of mine:

(?s)(?:<context>|(?!^)\G)(?:(?!</context>|ab).)*ab

Add capturing group as you need.

Caveat

Note that the regex only works for tags that are only allowed to contain only text. If a tag contains other tags, then it won't work correctly.

It also matches ab inside <context> tag without a closing tag </context>. If you want to prevent this then:

(?s)(?:<context>(?=.*?</context>)|(?!^)\G)(?:(?!</context>|ab).)*ab

Explanation

Let us break down the regex:

(?s)                        # Make . matches any character, without exception
(?:
  <context>
    |
  (?!^)\G
)
(?:(?!</context>|ab).)*
ab

(?:<context>|(?!^)\G) makes sure that we either gets inside a new <context> tag, or continue from the previous match and attempt to match more instance of sub-pattern.

(?:(?!</context>|ab).)* match whatever text that we don't care about (not ab) and prevent us from going past the closing tag </context>. Then we match the pattern we want ab at the end.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top