Question

I'm using the following line of regex which I found from this SO answer:

(?:[\w[a-z]-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.??][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?«»“”‘’])

I am testing it on the following string:

"Quattro Amici in Concert Mar. 3, 2014. Long-time collaborators Lun Jiang, violin; Roberta Zalkind, viola; Pegsoon Whang, cello; and Karlyn Bond, piano, will perform works by Franz Joseph Haydn, Wolfgang Amadeus Mozart, Ludwig van Beethoven and Gabriel Faure. To purchase tickets visit westminstercollege.edu/culturalevents or call 801-832-2457. - See more at: http://entertainment.sltrib.com/events/view/quattro_amici_in_concert#sthash.QRsLXXiA.dpuf"

I'm simply attempting to extract urls from strings and based on a bunch of SO answers, I've found that regex is the recommended tool for that job. I'm not a regex expert (or even intermediate in my understanding), so I'm baffled by the empty strings my re.findall() keeps returning. I've stepped through the regex line using regex buddy and still no luck. Any help would be hugely appreciated.

Was it helpful?

Solution

I'm not sure that a big regex like that is entirely necessary - if you're just looking to get links, you could use a much simpler regex, like this:

/(https?:\/\/[\w\d\$-_\.\+!\*'\(\),\/#]+)/ig

According to RFC 1738, urls are only allowed to use the characters specified in the class above, so it should cover any valid url, without such a gigantic mess of a regex.

You can also use a tool like regexpal.com to validate regexes, which helps find issues. That said, I pasted your regex in there and it crashed chrome, so it may not be a great help for a beast like that :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top