Question

How can I remove the duplicates from between class="" in the following string?

<li class="active active"><a href="http://netcoding.net/indev/sample-page/">Sample Page</a></li>

Please note that the classes shown can change and be in different positions.

Was it helpful?

Solution

You can use DOM parser then explode and array_unique:

$html = '<li class="active active">
         <a href="http://netcoding.net/indev/sample-page/">Sample Page</a></li>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//li");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $tok = explode(' ', $node->getAttribute('class'));
    $tok = array_unique($tok);
    $node->setAttribute('class', implode(' ', $tok));
}
$html = $doc->saveHTML();
echo $html;

OUTPUT:

<html><body>
<li class="active"><a href="http://netcoding.net/indev/sample-page/">Sample Page</a></li>
</body></html>

Online Demo

OTHER TIPS

With regex you could use a lookbehind and lookahead for finding duplicates:

$pattern = '/(?<=class=")(?:([-\w]+) (?=\1[ "]))+/i';

This would replace multiple instances of capture group 1 ([-\w]+) in a sequence.

$str = '<li class="active active">';

echo preg_replace($pattern, "", $str);

output:

<li class="active">

Test at regex101


EDIT 08.04.2014

To remove duplicates, that are not directly after the lookbehind (?<=class=")...

The problem is, that a lookbehind assertion can only be of fixed length. so something like (?<=class="[^"]*?) is not possible. As an alternative \K could be used, which resets the beginning of the match. A pattern could be:

$pattern = '/class="[^"]*?\K(?<=[ "])(?:([-\w]+) (?=\1[ "]))+/i';

You could imagine everything before \K as a virtual lookbehind of variable length.

This regex, as the first one, would only replace multiple instances of one duplicate in a sequence.


EDIT 11.09.2014

Finally I think a single regex, that would strip out all of different duplicates is getting rather complex:

/(?>(?<=class=")|(?!^)\G)(?>\b([-\w]++)\b(?=[^"]*?\s\1[\s"])\s+|[-\w]+\s+\K)/

This one uses continuous matching, as soon class=" is found.

Test at regex101; Also see SO Regex FAQ

A more simple way using regex would be a preg_replace_callback():

$html = '<li class="a1 a1 li li-home active li li active a1">';

$html = preg_replace_callback('/\sclass="\K[^"]+/', function ($m) {
  return trim(implode(" ",array_unique(preg_split('~\s+~', $m[0]))));
}, $html);

Note that older PHP-versions don't support anonymous functions (if so, change to a normal function).

A way to do it would be to add these values into an array and to filter them. Here is how it can be made.

<?php
   preg_match_all('/class="([A-Za-z0-9 ]+)"/',$htmlString, $result);
   $classes = explode(" ",$result[0]);
   $classes = array_unique($classes);
   echo "<li class=\"".implode(" ",$classes)."\"><a href=\"http://netcoding.net/indev/sample-page/\">Sample Page</a></li>";
?>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top