Question

I'm struggling with this. The idea is to replace all <link> tags, containing specific href attribute inside given string (which comes from a buffer and it is regular HTML, but malformed sometimes).

I've tried to use the PHP DOM approach, also the SimpleHTMLDOM parser library, so far nothing works for me (the problem is that DOM approach returns only links inside <body> element, but not those in <head> section of the page), so I decided to use regex. Here is the non-working PHP DOM approach code:

function remove_css_links($string = "", $css_files = array()) {
        $css_files = array("http://www.example.com/css/css.css?ver=2.70","style.css?ver=3.8.1");
            $xml = new DOMDocument();
        $xml->loadHTML($string);
        $link_list = $xml->getElementsByTagName('link');
        $link_list_length = $link_list->length;
        //The cycle
            for ($i = 0; $i < $link_list_length; $i++) {
          $attributes = $link_list->item($i)->attributes;
          $href = $attributes->getNamedItem('href');
          if (in_array($href->value, $css_files))  {
            //Remove the HTML node
          }                 
        }
        $string = $xml->saveHTML();
        return $string;
}

Here is the regex code, however I know that all of you do not recommend to use it for parsing of HTML, but let's not discuss this here and now:

$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website &raquo; Feed" href="/feed/" />
<link rel=\'stylesheet\'  href=\'http://www.example.com/css/css.css?ver=2.70\' type=\'text/css\' media=\'all\' /></head>
<body>...some content...
<link rel=\'stylesheet\' id=\'css\'  href=\'style.css?ver=3.8.1\' type=\'text/css\' media=\'all\' />
</body></html>
';
$url = preg_quote("http://www.example.com/css/css.css?ver=2.70");
$pattern = "~<link([^>]+) href=".$url."/?>~";
$link = preg_replace($pattern, "", $html_text);

The problem with the regex is that the href attribute can be at any place inside <link> tag and this one, which I use, can detect any type of <link> tags, as you can see I do not want to remove the shortcut icon or alternate types of them, as well as anything different than given URL as href attribute. You can notice that the <link> tags contains different type of quotes, single and/or double.

However, I'm open to suggestions and if it is possible to make the DOM approach work, rather than use regex - it's OK.

Was it helpful?

Solution

OK, so here you are :

<?php

$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website &raquo; Feed" href="/feed/" />
<link rel="stylesheet"  href="http://www.example.com/css/css.css?ver=2.70" type="text/css" media="all" /></head>
<body>...some content...
<link rel="stylesheet" id="css"  href="style.css?ver=3.8.1" type="text/css" media="all" />
</body></html>
';

$d = new DOMDocument();
@$d->loadHTML($html_text);
$xpath = new DOMXPath($d);
$result = $xpath->query("//link");

foreach ($result as $link)
{
    $href = $link->getattribute("href");

    if ($href=="whatyouwanttofilter")
    {
          $link->parentNode->removeChild($link);
    }

}

$output= $d->saveHTML();
echo $output;

?>

Tested and working. Have fun! :-)


The general idea is :

  • Load your HTML into a DOMDocument
  • Look for link nodes, using XPath
  • Loop through the nodes
  • Depending on the node's href attribute, delete the node (actually, remove the child from its... parent - well, yep, that's the php way... lol)
  • After doing all the cleaning-up, re-save the HTML and get it back into a string
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top