Matching Urls Inside Strings

https://stackoverflow.com/questions/20569321

01-09-2022
|

Question

I am trying to write a regex that will match urls inside strings of text that may be html-encoded. I am having a considerable amount of trouble with lookaround though. I need something that would correctly match both links in the string below:

 some text "http://www.notarealwebsite.com/?q=asdf&searchOrder=1" &quot;http://www.notarealwebsite.com&quot; some other text

A verbose description of what I want would be: "http://" followed by any number of characters that are not spaces, quotes, or the string "&quot[semicolon]" (I don't care about accepting other non-url-safe characters as delimiters)

I have tried a few regexes using lookahead to check for &'s followed by q's followed by u's and so on, but as soon as I put one into the [^...] negation it just completely breaks down and evaluates more like: "http:// followed by any number of characters that are not spaces, quotes, ampersands, q's, u's, o's, t's, or semicolons" which is obviously not what I am looking for.

This will correctly match the &'s at the beginning of the &quot[semicolon]:

&(?=q(?=u(?=o(?=t(?=;)))))

But this does not work:

http://[^ "&(?=q(?=u(?=o(?=t(?=;)))))]*

I know just enough about regexes to get into trouble, and that includes not knowing why this won't work the way I want it to. I understand to some extent positive and negative lookaround, but I don't understand why it breaks down inside the [^...]. Is it possible to do this with regexes? Or am I wasting my time trying to make it work?

Solution

If your regex implementation supports it, use a positive look ahead and a backreference with a non-greedy expression in the body.

Here is one with your conditions: (["\s]|")(http://.*?)(?=\1)

For example, in Python:

import re
p = re.compile(r'(["\s]|&quot;)(https?://.*?)(?=\1)', re.IGNORECASE)
url = "http://test.url/here.php?var1=val&var2=val2"
formatstr = 'text "{0}" more text {0} and more &quot;{0}&quot; test greed&quot;'
data = formatstr.format(url)    
for m in p.finditer(data):
    print "Found:", m.group(2)

Produces:

Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2

Or in Java:

@Test
public void testRegex() {
    Pattern p = Pattern.compile("([\"\\s]|&quot;)(https?://.*?)(?=\\1)", 
        Pattern.CASE_INSENSITIVE);
    final String URL = "http://test.url/here.php?var1=val&var2=val2";
    final String INPUT = "some text " + URL + " more text + \"" + URL + 
            "\" more then &quot;" + URL + "&quot; testing greed &quot;";

    Matcher m = p.matcher(INPUT);
    while( m.find() ) {
        System.out.println("Found: " + m.group(2));
    }
}

Produces the same output.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow