Question

I have some basic HTML which I am calling str_replace() on, I need to append all URLs found within an HTML string with a 'generate_book.php?link=', but I need to exclude any external links, eg;

<a href="gst/3.html">Link</a> -- this should become -- <a href="generate_book.php?link=gst/3.html"></a>

<a href="http://example.com">Link</a> -- this should be left alone

Your brain powa is appreciated!

Was it helpful?

Solution

You'll want to use a look-ahead at the beginning to make sure it does not match HTTP or HTTPS. You could also add mailto if you are worried about it.

$str = preg_replace("/(?<=href=\")(?!http:\/\/|https:\/\/)([^\"]+)/i", "generate_book.php?link=$1", $str);

This regex also uses a look-behind ( the (?<=href=\")) so that it doesn't actually match the href=".

Warnings:

  • Need to be aware of which URL schemes will be in the HTML besides HTTP and HTTPS, if any.
  • Some tags like the link tag also have an href attribute. Make sure you aren't replacing these. If you need to match only A tags by using Regex, your regex complexity will grow considerably and still won't really be safe.
  • Regex Eval is much less efficient and unsafe, but if you need URL encoding, you can attempt to URL encode it at replace time like the second return of the other answer does.
  • Overall, Regex is not necessarily the best solution for this. You might be better off with an HTML parser...

OTHER TIPS

Give this a try:

$str = preg_replace(
    "(href=\"([^\"]+)\")ie",
    "if(substr('$1',0,7) == 'http://')
        return stripslashes('$1');
     else
        return 'generate_book.php?link='.urlencode(stripslashes('$1'));",
    $str);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top