Question

I am trying to scrape a html page of just links, I need to find for example all heading tags that are links and find any images that it may have, as an example a news website will have a heading.

//scenario 1
<h2><a href="link-to-page">myHeading</a></h2> //image as sibling
<a href="link-to-page"><img src="img.jpg" /></a>

//scenario 2
<h2><a href="link-to-page">myHeading
   <img src="img.jpg" />
</a></h2> // image as child

I can handled the image as child by using

$array=$html->find('h2 a');

foreach($array['h2'] as $h2{
   $heading[]=array('link'=>$h2->href, 'text'=>$h2->plaintext, 'img'=>$h2->find('img',0));
   echo $heading[$i]['link'].'<br />';
   echo $heading[$i]['text'].'<br />';
   echo $heading[$i]['img'].'<br />';
}
//of course this will be layout out differntly but at the moment just trying to get the image

The above code only works if image is a direct child of the 'H2' tag, in some cases the image will be a sibling in which case I am at a loss as to handle that. I have experimented with next_sibling() but i cant seem to get this to work, does anyone have any suggestions about how to handle this scenario where and image is not a child of the parent tag but a sibling. Perhaps my approach needs to be re-thought. What i have to do is find the image that is associated with the heading and it could be in 1 of 2 scenarios, a child or sibling of the link

Thank you in advance

Was it helpful?

Solution

Using DOMDocument, this is possible. If you need to search for every possible valid heading tag: h1, h2, h3, h4, h5, h6, then this could all be done in one loop. After finding a heading tag, we will use that node as the root node to start searching for the other required tags.

$dom = new DOMDocument(''); 

// prevents PHP from warning us that header, footer are invalid tags.
@$dom->loadHTMLFile($url); 

$links  = array();
$images = array();

for($i = 1; $i <= 6; $i++) {
  $heading_level = (string)$i;
  $heading = 'h' . $heading_level;

  foreach($dom->getElementsByTagName($heading) as $h) {   
    foreach($h->getElementsByTagName('a') as $link) {
      array_push($links, array(
        "href"      => $link->getAttribute('href'),
        "innerHTML" => $link->nodeValue
      ));
    }
    foreach($h->getElementsByTagName('img') as $img) {
      array_push($images, array(
        "src" => $img->getAttribute('src')
      ));
    }
  }
}

OTHER TIPS

include_once "simple_html_dom.php";

$url = "index.html";

$html = file_get_html($url);

foreach ($html->find("h2") as $h){

  foreach ($h->find("a") as $a){

    echo $a->href ."<br />";
    $img = $a->find("img",0);
    echo $img->src ."<br />";
  }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top