Question

How can I exclude href matches for a domain (ex. one.com)?

My current code:

$str = 'This string has <a href="http://one.com">one link</a> and <a href="http://two.com">another link</a>';
$str = preg_replace('~<a href="(https?://[^"]+)".*?>.*?</a>~', '$1', $str);
echo $str; // This string has http://one.com and http://two.com

Desired result:

This string has <a href="http://one.com">one link</a> and http://two.com
Was it helpful?

Solution

Using a regular expression

If you're going to use a regular expression to accomplish this task, you can use a negative lookahead. It basically asserts that the part // in the href attribute is not followed by one.com. It's important to note that a lookaround assertion doesn't consume any characters.

Here's how the regular expression would look like:

<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>

Regex Visualization:

Regex101 demo


Using a DOM parser

Even though this is a pretty simple task, the correct way to achieve this would be using a DOM parser. That way, you wouldn't have to change the regex if the format of your markup changes in future. The regex solution will break if the <a> node contains more attribute values. To fix all those issues, you can use a DOM parser such as PHP's DOMDocument to handle the parsing:

Here's how the solution would look like:

$dom = new DOMDocument(); 
$dom->loadHTML($html); // $html is the string containing markup

$links = $dom->getElementsByTagName('a');

//Loop through links and replace them with their anchor text
for ($i = $links->length - 1; $i >= 0; $i--) {
    $node = $links->item($i);

    $text = $node->textContent;
    $href = $node->getAttribute('href');

    if ($href !== 'http://one.com') {
        $newTextNode = $dom->createTextNode($text);
        $node->parentNode->replaceChild($newTextNode, $node);
    }
}

echo $dom->saveHTML();

Live Demo

OTHER TIPS

This should do it:

<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>

We use a negative lookahead to make sure that one.com does not appear directly after the https?://.


If you also want to check for some subdomains of one.com, use this example:

<a href="(https?://(?!((www|example)\.)?one\.com)[^"]+)".*?>.*?</a>

Here we optionally check for www. or example. before one.com. This will allow a URL like misc.com, though. If you want to remove all subdomains of one.com, use this:

<a href="(https?://(?!([^.]+\.)?one\.com)[^"]+)".*?>.*?</a>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top