Question
I have used the following regex to get the urls from text (e.g. "this is text http://url.com/blabla possibly some more text"
).
'@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)@'
This works for all URLs but I just found out it doesn't work for URLs shortened like: "blabla bla http://ff.im/-bEnA blabla"
becomes http://ff.im/
after the match.
I suspect it has to do with the dash -
after the slash /
.
No correct solution
OTHER TIPS
Short answer: [\w/_\.]
doesn't match -
so make it [-\w/_\.]
Long answer:
@ - delimiter
( - start of group
https?:// - http:// or https://
([-\w.]+)+ - capture 1 or more hyphens, word characters or dots, 1 or more times.. this seems odd - don't know what the second + is for
(:\d+)? - optionally capture a : and some numbers (the port)
( - start of group
/ - leading slash
( - start of group
[\w/_\.] - any word character, underscore or dot - you need to add hyphen to this list or just make it [^?\S] - any char except ? or whitespace (the path + filename)
(\?\S+)? - optionally capture a ? followed by anything except whitespace (the querystring)
)? - close group, make it optional
)? - close group, make it optional
) - close group
@
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow