php dom document remove some html tags but keep inner tags and text

Question 1

I found a way to make it work. The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.

So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.

The code is:

$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!-- 
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
    $node=$oldnodes->item(0);
    $fragment = $doc->createDocumentFragment();
    while($node->childNodes->length > 0) {
        $fragment->appendChild($node->childNodes->item(0));
    }
    $node->parentNode->replaceChild($fragment, $node);
    $oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();

I hope that will be helpful for someone who finds same difficulties.

Question 2

You can use strip_tags function in PHP.

$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');

This remove all tags except html,body,a

And output is:

<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>

EDIT: If it is input from user, it's better for security reason to use whitelist tags and not blacklist.

Question 3

If your code only contains simple HTML tags without any attributes you can keep it simple like:

$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';

$removedTags = preg_replace($pattern, '', $value);

Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.

This code snippet is only for simple code, but fits to your HTML input and output example.

Question 4

Try this.. Just replace the for loop with the below code.

foreach ($oldnodes as $node) {
    $children = $node->childNodes;
    $string = "";
    foreach($children as $child) {
        $childString = $doc->saveXML($child);
        $string = $string."".$childString;
    }
    $fragment = $doc->createDocumentFragment();
    $fragment->appendXML($string);
    $node->parentNode->insertBefore($fragment,$node);
    $node->parentNode->removeChild($node);
}