Domanda

I try to extract all not formatted urls from a bbcode text. Regex should match:

1. "^http://xyz.abc$"
2. "(http://xyz.abc)"
3. " http://xyz.abc "

but not formatted bbcode urls like [url]http://xyz.abc[/url]

The final regex is

#(?:^|\s|\()((?:www\.|https?:(?:\/\/)).*?\..*?)(?:\s|\)|$)#i

While testing I had some trouble with this and found out the trouble comes from \r\n. Example:

$text = "http://www.url.com/xxx/yyy/1.html
http://www.url.com/xxx/yyy/2.html
http://www.url.com/xxx/yyy/3.html";
//or with \n
//$text = "http://www.url.com/xxx/yyy/1.html\nhttp://www.url.com/xxx/yyy/2.html\nhttp://www.url.com/xxx/yyy/3.html";

preg_match_all('#(?:^|\s|\()((?:www\.|https?:(?:\/\/)).*?\..*?)(?:\s|\)|$)#i', $text, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE);

$matches contains the first and the last url.

But With \r\n

$text = "http://www.url.com/xxx/yyy/1.html\r\nhttp://www.url.com/xxx/yyy/2.html\r\nhttp://www.url.com/xxx/yyy/3.html";

$matches contains all urls. Why doesn't it work with \n?

You can verify this here http://www.functions-online.com/preg_match_all.html

È stato utile?

Soluzione

I fixed it now for myself

#(?:^|\s|\()((?:www\.|https?:(?:\/\/)).*?\..*?)(?:\s|\)|$)#im

The "m" modifier fixed the ^...$ problem.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top