Frage

I need to remove some tags (e.g. <div></div>) in HTML document and keep inner tags and text. I managed to do that with Simple HTML Dom Parser. But it can't process big files due to huge memory requirements. I would prefer to use native PHP tools like DOMDocument cause I read that it's more optimized and quicker in processing HTML documents. But I struggle at the first stage - how to remove some tags while keeping inner text and tags.

Source HTML sample is:

<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>

I try this code:

$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
foreach ($oldnodes as $node) {
    $fragment = $doc->createDocumentFragment();
    while($node->childNodes->length > 0) {
        $fragment->appendChild($node->childNodes->item(0));
    }
    $node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();

It produces the output:

<html><body>00000aaaaa<div>bbbbbbccc<a>link</a>cccdddddd</div>eeeee<div>1111</div></body></html>

I need the following:

<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>

Could someone please help me with proper code for the task?

War es hilfreich?

Lösung 4

I found a way to make it work. The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.

So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.

The code is:

$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!-- 
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
    $node=$oldnodes->item(0);
    $fragment = $doc->createDocumentFragment();
    while($node->childNodes->length > 0) {
        $fragment->appendChild($node->childNodes->item(0));
    }
    $node->parentNode->replaceChild($fragment, $node);
    $oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();

I hope that will be helpful for someone who finds same difficulties.

Andere Tipps

You can use strip_tags function in PHP.

$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');

This remove all tags except html,body,a

And output is:

<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>

EDIT: If it is input from user, it's better for security reason to use whitelist tags and not blacklist.

If your code only contains simple HTML tags without any attributes you can keep it simple like:

$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';

$removedTags = preg_replace($pattern, '', $value);

Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.

This code snippet is only for simple code, but fits to your HTML input and output example.

Try this.. Just replace the for loop with the below code.

foreach ($oldnodes as $node) {
    $children = $node->childNodes;
    $string = "";
    foreach($children as $child) {
        $childString = $doc->saveXML($child);
        $string = $string."".$childString;
    }
    $fragment = $doc->createDocumentFragment();
    $fragment->appendXML($string);
    $node->parentNode->insertBefore($fragment,$node);
    $node->parentNode->removeChild($node);
}
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top