URL detection and BB-Style tags (regex, look-ahead issue)

https://stackoverflow.com/questions/15005621

10-03-2022
|

Question

so I'm building a small CMS and I'd like to avoid allowing HTML in the content editor. For that reason I want to detect raw URLs in text aswell as supporting BB-like tags, for better customization.

www.example.com
[link http://www.example.com]Click me[/link]

Unfortunately I'm fairly new to regular expressions and I just can't seem to get this working. I'm running two regular expressions over the string: The first detects raw URLs, the second BB-like URLs. The latter seems to work perfectly fine, the first one interferes though, and converts URLs wrapped in tags too.

I started off with a piece of code I found here and made some additions.

This is the code for non-tag URLs:

/* don't match URLs preceeded by '[link ' */
(?<!\[link\s)
(
    /* match all combinations of protocol and www. */
    (\bhttps?://www\.|\bhttps?://|(?<!//)\bwww\.)

    /* match URL (no changes made here) */
    ([^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

    /* but don't match if followed by [/link] - THIS DOESN'T WORK */
    (?!\[/link\])
)

The negative look-behind before the www. is there because / isn't a word character, and without it something like

 [link http://www.example.com]example[/link]

would still match after http://.

The regex above produces the following matches (tested with http://gskinner.com/RegExr/, matches are in bold. I had to add spaces after http://because I'm not allowed to post more URLs):

www.example.com
http:// www.example.com
http:// example.com
[link http://www.example.com]no problem 1[/link]
[link www.example.com]no problem 2[/link]
[link http://www.example.com]http://www.example.com[/link]

I've tried moving the negative look-ahead around and played with the parentheses (pretty aimlessly), without success.

For completeness, here's the tag-matching regex (which seems to work):

(?:\[link\s)(\bhttps?://|\bwww\.|\bhttps?://www\.)([^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))\](.*)(?:\[/link\])

I'm sure someone can spot the error immediately.

Thanks a lot in advance!

Solution

I have taken your regex, insterted it into regexr with the examples you have given and tried to make it work.

Step by step:

1) The original regex: http://regexr.com?33snj. The problem why this regex also matches the [/link] is in the URL matching bit:

[^\s()<>]+

This will also match the open bracket character '[', therefore matching will not stop when it encounters the [/link] bit. It could be argued that the [ character is a valid URI character, but that is only under rare conditions (see this stackoverflow post for more info).

2) I decided to continue with your regex, but added the open bracket char to the negated character list:

[^\s()<>[]+

This will get you into another problem. See http://regexr.com?33snp. Because of backtracking the engine now finds a way around the negative lookahead at the end.

3) Once you make the URL matching group atomic (by adding ?> to the start of the capture group) the engine stops backtracking and we have arrived at the desired outcome.

(?<!\[link\s)((\bhttps?://www\.|\bhttps?://|(?<!//)\bwww\.)(?>[^\s()<>[]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[/link\]))

See it in action http://regexr.com?33sns.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow