Domanda

I need to fashion a regex with the following requirements:

Given sample text:

SEARCH_TERM_#1 find this text SEARCH-TERM_#2_more text_SEARCH-TERM_#3
SEARCH_TERM_#1 find this text SEARCH-TERM_#3

I want to extract the string which appears in the find this text area

The regex should collect data after SEARCH_TERM_#1 upto but not including SEARCH_TERM_#2 or SEARCH-TERM_#3 which ever comes first. It should choose as the 'right-side' search border whatever it finds first of #2 and #3.

I've tried (?>SEARCH_TERM_#2|SEARCH_TERM_#3) (?=(?>SEARCH_TERM_#2|SEARCH_TERM_#3)) and (?>(?=SEARCH_TERM_#2)|(?=SEARCH_TERM_#3)) . And they ALL include the second search term into the collected data and stop before the third, while I want the collected data stop before the #2 or #3 which ever comes first.

È stato utile?

Soluzione

Description

This regular expression will:

  • find the first SEARCH_TERM_#1
  • capture text starting after SEARCH_TERM_#1
  • stop capturing text when it encounters either SEARCH_TERM_#2 or SEARCH_TERM_#3 (which ever is first

^.*?SEARCH_TERM_\#1((?:(?!SEARCH-TERM_\#2|SEARCH-TERM_\#3).)*)

enter image description here

Expanded

  • ^ match the begining of the string, this forces the search to start at the beginning
  • .*? match all characters upto the next expression. note this term should be used in conjuction with the s option which allows the dot to match new line characters
  • SEARCH_TERM_\#1 the first search term
  • ( start the capture group this set of parentheses puts the matched values into the capture group 1
  • (?: start non capture group, this the real magic, and basically allows the contained expression to continue matching until it stumbles on either SEARCH-TERM_\#2 or SEARCH-TERM_\#3
    • (?! start the negative lookahead. think of the regex engine moving a cursor through the input string. The loohahead simply looks at the characters after the cursor without moving the cursor. The negative means that if the found expression resolves as matched then deny the match, or if the expression is not found, then allow the match.
    • SEARCH-TERM_\#2|SEARCH-TERM_\#3 look for either value. the | is an "or" statement
    • ) close the negative lookahead
    • . match any character. The expression only gets to this spot if the preceding negative lookahead didn't find it's search terms
    • ) close the non capture group, at this point either the searching as stopped because it encountered the #2 or #3 end condition or the non capture group found a single character
  • * continue greedily matching all characters. You can use greedy because the end condition is contained inside the expression.
  • ) close the capture group

    PHP code example

You didn't specify a language so I'm including this PHP example only to show how it works.

Input Text

skip this text SEARCH_TERM_#1 find this text SEARCH-TERM_#2 more text to ignore SEARCH_TERM_#3

Code

<?php
$sourcestring="your source string";
preg_match('/^.*?SEARCH_TERM_\#1((?:(?!SEARCH-TERM_\#2|SEARCH-TERM_\#3).)*)/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

$matches Array:
(
    [0] => skip this text SEARCH_TERM_#1 find this text 
    [1] =>  find this text 
)

Real World Example

Or to use your real world example included in the comments:

Regex: ^.*?style="background-image: url\(((?:(?!&cfs=1|\)).)*)

Input text: <a href=http://i.like.kittens.com style="background-image: url(http://I.like.kittens.com?Name=Boots&cfs=1)">

Matches:

[0] => <a href=http://i.like.kittens.com style="background-image: url(http://I.like.kittens.com?Name=Boots
[1] => http://I.like.kittens.com?Name=Boots

Disclaimer

This vaguely looks like common problem in parsing HTML using regex. If your input text is HTML then you should investigate using an HTML parsing tool rather then a regular expression.

Altri suggerimenti

This pattern works well:

SEARCH_TERM_#1(.*?)SEARCH-TERM_#2_OR_#3

The content you are interested by is in the first capture groups, see your language or software documentation to know how refer to the capture groups content.

If supported you can use lookarounds:

(?<=SEARCH_TERM_#1).*?(?=SEARCH-TERM_#2_OR_#3)

Then the result is the whole pattern.

Note that i use a lazy quantifier *? instead of a greedy quantifier *. More informations here.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top