How can I find text nodes in an HTML snippet?

https://stackoverflow.com/questions/4782152

23-10-2019
|

Question

I'm trying to parse an HTML snippet, using the PHP DOM functions. I have stripped out everything apart from paragraph, span and line break tags, and now I want to retrieve all the text, along with its accompanying styles.

So, I'd like to get each piece of text, one by one, and for each one I can then go back up the tree to get the values of particular attributes (I'm only interested in some specific ones, like color etc.).

How can i do this? Or am I thinking about it the wrong way?

Thanks!

No correct solution

OTHER TIPS

Suppose you have a DOMDocument here:

$doc = new DOMDocument();
$doc->loadHTMLFile('http://stackoverflow.com/');

You can find all text nodes using a simple Xpath.

$xpath = new DOMXpath($doc);
$textNodes = $xpath->query('//text()');

Just foreach over it to iterate over all textnodes:

foreach ($textNodes as $textNode) {
    echo $textNode->data . "\n";
}

From that, you can go up the DOM tree by using ->parentNode.

Hope that this can give you a good start.

For those who are more comfortable with CSS3 selectors, and are willing to include a single extra PHP class into their project, I would suggest the use of Simple PHP DOM parser. The solution would look something like the following:

$html = file_get_html('http://www.example.com/');

$ret = $html->find('p, span');    
$store = array();

foreach($ret as $element) {
    $store[] = array($element->tag => array('text' => $element->innertext, 
                                            'color' => $element->color, 
                                            'style' => $element->style));
}
print_r($store);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow