Question

I want to grab all IDs (integers) from several URLs within a text. These URLs could look like these:

http://url.tld/index.php/p1
http://url.tld/p2#abc
http://url.tld/index.php/Page/3-xxx
http://url.tld/Page/4

For this, I've built two regexes (the URLs are enclosed by an URL bbcode):

\[url\](http\://url\.tld/index\.php/p(\d+).*?\)[/url\]
\[url\](http\://url\.tld(?:/index\.php)?/Page/(\d+).*?\)[/url\]

However, if i do a preg_match_all with every single regex, I get an array that looks like this (and which is correct):

array(3) {
  [0]=>
  array(2) {
    [0]=>
    string(62) "[url]http://url.tld/index.php/Page/6-fdgfh/[/url]"
    [1]=>
    string(50) "[url]http://url.tld/Page/7[/url]"
  }
  [1]=>
  array(2) {
    [0]=>
    string(51) "http://url.tld/index.php/Page/6-fdgfh/"
    [1]=>
    string(39) "http://url.tld/Page/7"
  }
  [2]=>
  array(2) {
    [0]=>
    string(1) "6"
    [1]=>
    string(1) "7"
  }
}

But if I combine both regexes with a pipe:

\[url\](http\://url\.tld/index\.php/p(\d+).*?|http\://url\.tld(?:/index\.php)?/Page/(\d+).*?)\[/url\]

it builds an array like this (which is wrong):

array(4) {
  [0]=>
  array(3) {
    [0]=>
    string(71) "[url]http://url.tld/index.php/p9-abc#hashtag[/url]"
    [1]=>
    string(62) "[url]http://url.tld/index.php/Page/6-fdgfh/[/url]"
    [2]=>
    string(50) "[url]http://url.tld/Page/7[/url]"
  }
  [1]=>
  array(3) {
    [0]=>
    string(60) "http://url.tld/index.php/t9-abc#hashtag"
    [1]=>
    string(51) "http://url.tld/index.php/Page/6-fdgfh/"
    [2]=>
    string(39) "http://url.tld/Page/7"
  }
  [2]=>
  array(3) {
    [0]=>
    string(1) "9"
    [1]=>
    string(0) ""
    [2]=>
    string(0) ""
  }
  [3]=>
  array(3) {
    [0]=>
    string(0) ""
    [1]=>
    string(1) "6"
    [2]=>
    string(1) "7"
  }
}

====

So, my question is: How can I fix this? What I need is the array structure from the first example, while using both regular expressions as one regular expression, because I need a consistent structure to do a preg_replace_callback later.

Was it helpful?

Solution

I think you're looking for the Branch Reset group:

\[url]((?|http://url\.tld/index\.php/p(\d+).*?|http://url\.tld(?:/index\.php)?/Page/(\d+).*?))\[/url]

Or, for the line-noise-challenged among us:

\[url]
(
  (?|
    http://url\.tld/index\.php/p(\d+)[^[]*
  |
    http://url\.tld(?:/index\.php)?/Page/(\d+)[^[]*
  )
)
\[/url]

This captures the numbers in group #2, no matter which part of the regex matched it. The whole URL is still captured in group #1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top