Using regex to match string between two strings while excluding strings

https://stackoverflow.com/questions/1993173

22-09-2019
|

Question

Following on from a previous question in which I asked:

How can I use a regular expression to match text that is between two strings, where those two strings are themselves enclosed two other strings, with any amount of text between the inner and outer enclosing strings?

I got this answer:

/outer-start.*?inner-start(.*?)inner-end.*?outer-end/

I would now like to know how to exclude certain strings from the text between the outer enclosing strings and the inner enclosing strings.

For example, if I have this text:

outer-start some text inner-start text-that-i-want inner-end some more text outer-end

I would like 'some text' and 'some more text' not to contain the word 'unwanted'.

In other words, this is OK:

outer-start some wanted text inner-start text-that-i-want inner-end some more wanted text outer-end

But this is not OK:

outer-start some unwanted text inner-start text-that-i-want inner-end some more unwanted text outer-end

Or to explain further, the expression between outer and inner delimiters in the previous answer above should exclude the word 'unwanted'.

Is this easy to match using regexes?

Solution

Replace the first and last (but not the middle) .*? with (?:(?!unwanted).)*?. (Where (?:...) is a non-capturing group, and (?!...) is a negative lookahead.)

However, this quickly ends up with corner cases and caveats in any real (instead of example) use, and if you would ask about what you're really doing (with real examples, even if they're simplified, instead of made up examples), you'll likely get better answers.

OTHER TIPS

A better question to ask yourself than "how do I do this with regular expressions?" is "how do I do solve this problem?". In other words, don't get hung up on trying to solve a big problem with regular expressions. If you can solve half the problem with regular expressions, do so, then solve the other half with another regular expression or some other technique.

For example, make a pass over your data getting all matches, ignoring the unwanted text (read: get results both with and without the unwanted text). Then, make a pass over the reduced set of data and weed out those results that have the unwanted text. This sort of a solution is easier to write, easier to understand and easier to maintain over time. And for any problem you're likely to need to solve with this approach it will be sufficiently fast enough.

You can replace .*? with

 ([^u]|u[^n]|un[^w]|unw[^a]|unwa[^n]|unwan[^t]|unwant[^e]|unwante[^d])*?

This is a solution in "pure" regex; the language you are using might allow you to use some more elegant construct.

You can't easily do that with plain regexes, but some systems such as Perl have extensions that make it easier. One way is to use a negative look-ahead assertion:

/outer-start(?:u(?!nwanted)|[^u])*?inner-start(.*?)inner-end.*?outer-end/

The key is to split up the "unwanted" into ("u" not followed by "nwanted") or (not "u"). That allows the pattern to advance, but will still find and reject all "unwanted" strings.

People may start hating your code if you do much of this though. ;)

Try replacing the last .*? with: (?!(.*unwanted text.*))

Did it work?

Tola, resurrecting this question because it had a fairly simple regex solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The idea is to build an alternation (a series of |) where the left sides match what we don't want in order to get it out of the way... then the last side of the | matches what we do want, and captures it to Group 1. If Group 1 is set, you retrieve it and you have a match.

So what do we not want?

First, we want to eliminate the whole outer block if there is unwanted between outer-start and inner-start. You can do it with:

outer-start(?:(?!inner-start).)*?unwanted.*?outer-end

This will be to the left of the first |. It matches a whole outer block.

Second, we want to eliminate the whole outer block if there is unwanted between inner-end and outer-end. You can do it with:

outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end

This will be the middle |. It looks a bit complicated because we want to make sure that the "lazy" *? does not jump over the end of a block into a different block.

Third, we match and capture what we want. This is:

inner-start\s*(text-that-i-want)\s*inner-end

So the whole regex, in free-spacing mode, is:

(?xs)
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
| # OR (also don't want that)
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
| # OR capture what we want
inner-start\s*(text-that-i-want)\s*inner-end

On this demo, look at the Group 1 captures on the right: It contains what we want, and only for the right block.

In Perl and PCRE (used for instance in PHP), you don't even have to look at Group 1: you can force the regex to skip the two blocks we don't want. The regex becomes:

(?xs)
(?: # non-capture group: the things we don't want
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
| # OR (also don't want that)
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
)
(*SKIP)(*F) # we don't want this, so fail and skip
| # OR capture what we want
inner-start\s*\Ktext-that-i-want(?=\s*inner-end)

See demo: it directly matches what you want.

The technique is explained in full detail in the question and article below.

Reference

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow