Question

Below is some random unpredictable set of tags wrapped inside a div tag. How to explode all the child tags innerHTML preserving the order of its occurrence.

Note: In case of img, iframe tags need to extract only the urls.

 <div>
  <p>para-1</p>
  <p>para-2</p>
  <p>
    text-before-image
    <img src="text-image-src"/>
    text-after-image</p>
  <p>
    <iframe src="p-iframe-url"></iframe>
  </p>
  <iframe src="iframe-url"></iframe>
  <h1>header-1</h1>
  <img src="image-url"/>
  <p>
    <img src="p-image-url"/>
  </p>
  content not wrapped within any tags
  <h2>header-2</h2>
  <p>para-3</p>
  <ul>
    <li>list-item-1</li>
    <li>list-item-2</li>
  </ul>
  <span>span-content</span>
 content not wrapped within any tags
</div>

Expected array:

 ["para-1","para-2","text-before-image","text-image-src","text-after-image",
"p-iframe-url","iframe-url","header-1","image-url",
"p-image-url","content not wrapped within any tags","header-2","para-3",
"list-item-1","list-item-2","span-content","content not wrapped within any tags"]

Relevant code:

 $dom     = new DOMDocument();
        @$dom->loadHTML( $content );
        $tags = $dom->getElementsByTagName( 'p' );
        // Get all the paragraph tags, to iterate its nodes.
        $j = 0;
        foreach ( $tags as $tag ) {
            // get_inner_html() to preserve the node's text & tags
            $con[ $j ] = $this->get_inner_html( $tag );
            // Check if the Node has html content or not
            if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {      
                // Check if the node contains html along with plain text with out any tags
                if ( $tag->nodeValue != '' ) {
                    /*
                     * DOM to get the Image SRC of a node
                     */
                    $domM      = new DOMDocument();
                    /*
                     * Setting encoding type http://in1.php.net/domdocument.loadhtml#74777
                     * Set after initilizing DomDocument();
                     */
                    $con[ $j ] = mb_convert_encoding( $con[ $j ], 'HTML-ENTITIES', "UTF-8" );
                    @$domM->loadHTML( $con[ $j ] );
                    $y = new DOMXPath( $domM );
                    foreach ( $y->query( "//img" ) as $node ) {
                        $con[ $j ] = "img=" . $node->getAttribute( "src" );
                        // Increment the Array size to accomodate bad text and image tags.
                        $j++;
                        // Node incremented, fetch the node value and accomodate the text without any tags.
                        $con[ $j ] = $tag->nodeValue;
                    }
                    $domC      = new DOMDocument();
                    @$domC->loadHTML( $con[ $j ] );
                    $z = new DOMXPath( $domC );
                    foreach ( $z->query( "//iframe" ) as $node ) {
                        $con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
                        // Increment the Array size to accomodate bad text and image tags.

                        $j++;
                        // Node incremented, fetch the node value and accomodate the text without any tags.
                        $con[ $j ] = $tag->nodeValue;
                    }
                } else {
                    /*
                     * DOM to get the Image SRC of a node
                     */
                    $domA      = new DOMDocument();
                    @$domA->loadHTML( $con[ $j ] );
                    $x = new DOMXPath( $domA );
                    foreach ( $x->query( "//img" ) as $node ) {
                        $con[ $j ] = "img=" . $node->getAttribute( "src" );
                    }

                    if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
                        foreach ( $x->query( "//iframe" ) as $node ) {
                            $con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
                        }
                    }
                }
            }
            // INcrement the node
            $j++;
        }

        $this->content = $con;
Was it helpful?

Solution

A quick and easy way of extracting interesting pieces of information from a DOM document is to make use of XPath. Below is a basic example showing how to get the text content and attribute text from a div element.

<?php

// Pre-amble, scroll down to interesting stuff...
$html = '<div>
  <p>para-1</p>
  <p>para-2</p>
  <p>
    <iframe src="p-iframe-url"></iframe>
  </p>
  <iframe src="iframe-url"></iframe>
  <h1>header-1</h1>
  <img src="image-url"/>
  <p>
    <img src="p-image-url"/>
  </p>
  content not wrapped within any tags
  <h2>header-2</h2>
  <p>para-3</p>
  <ul>
    <li>list-item-1</li>
    <li>list-item-2</li>
  </ul>
  <span>span-content</span>
 content not wrapped within any tags
</div>';

$doc = new DOMDocument;
$doc->loadHTML($html);
$div = $doc->getElementsByTagName('div')->item(0);

// Interesting stuff:

// Use XPath to get all text nodes and attribute text
// $tests becomes a DOMNodeList filled with DOMText and DOMAttr objects
$xpath = new DOMXPath($doc);
$texts = $xpath->query('descendant-or-self::*/text()|descendant::*/@*', $div);

// You could only include/exclude specific attributes by looking at their name
// e.g. multiple paths: .//@src|.//@href
// or whitelist:        descendant::*/@*[name()="src" or name()="href"]
// or blacklist:        descendant::*/@*[not(name()="ignore")]

// Build an array of the text held by the DOMText and DOMAttr objects
// skipping any boring whitespace
$results = array();
foreach ($texts as $text) {
    $trimmed_text = trim($text->nodeValue);
    if ($trimmed_text !== '') {
        $results[] = $trimmed_text;
    }
}

// Let's see what we have
var_dump($results);

OTHER TIPS

Try a recursive approach! Get an empty array $parts on your class instance and a function extractSomething(DOMNode $source). You function should the process each separate case, and then return. If source is a

  • TextNode: push to $parts
  • Element and name=img: push its href to $parts
  • other special cases
  • Element: for each TextNode or Element child call extractSomething(child)

Now when a call to extractSomenting(yourRootDiv) returns, you will have the list in $this->parts.

Note that you have not defined what happens with <p> sometext1 <img href="ref" /> sometext2 <p> but the above example is driving toward adding 3 elements ("sometext1", "ref" and "sometext2") on its behalf.

This is just a rough outline of the solution. The point is that you need to process each node in the tree (possibly not really regarding its position), and while walking them in the right order, you build your array by transforming each node to the desired text. Recursion is the fastest to code but you may alternatively try width traversal or walker tools.

Bottom line is that you have to accomplish two tasks: walk the nodes in a correct order, transform each to the desired result.

This is basically a rule of thumb for processing a tree/graph structure.

The simplest way is to use DOMDocument: http://www.php.net/manual/en/domdocument.loadhtmlfile.php

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top