Question

I am trying to get review, rate and date from this page:

http://www.yelp.com/biz/franchino-san-francisco?start=80

By taking the some code from page I could get the result here:

https://eval.in/143036

But when I apply it for entire page source using file_get_contents it gives set of warning like:

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 700 in F:\wamp\www\htdocs\thenwat\yelp.php.

I used escapeshellarg and nl2br to remove this issue but no help.

Please see : https://eval.in/143074

Below code works fine for smaller source snipper : https://eval.in/143036

$classname = 'rating-qualifier';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");

if ($results->length > 0) {
    echo $review = $results->item(0)->nodeValue;
}   
$classname = 'review_comment ieSucks';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");    
if ($results->length > 0) {
    echo $review = $results->item(0)->nodeValue;
}    
$meta = $dom->documentElement->getElementsByTagName("meta");
echo $meta->item(0)->getAttribute('content');
Was it helpful?

Solution

DOMDocument is great for well formed documents but not all HTML pages are well formed. Use Simple HTML DOM http://sourceforge.net/projects/simplehtmldom/ instead. I have created a working solution that extracts the data you requested.

yelp.php

<?php

  ini_set('display_errors', 1);
  error_reporting(E_ALL ^ E_NOTICE);

   /************************************************
   *                                               *
   *    2014.04.28                                 *
   *    Developed by Ben McFarlin at Qeala Labs    *
   *    www.qeala.com                              *
   *                                               *
   ************************************************/

    include_once('simple_html_dom.php');

  function yelp($url){
    print("$url\n");

    $root = new stdClass();
    $items = array();
    $html = file_get_html($url);

    if($html){

      $containers = $html->find('div.review-list div.review div.review-wrapper');
      foreach($containers as $container){
        $comments = $container->find('div.review-content p.review_comment');
        $item = new stdClass();
        foreach($comments as $comment){
          $comment_html = $comment->innertext();
          $item->comment = $comment_html;
        }
        $metas = $container->find('div.review-content meta');
        foreach($metas as $meta){
          $itemprop = $meta->itemprop;
          $content = $meta->content;
          if($itemprop == 'ratingValue') $key = 'rating';
          else $key = 'date';
          $item->$key = $content;
        }
        $items[] = $item;
      }
    }

    $root->items = $items;

    if($html){
      $html->clear();
      unset($html);
    }

    return $root;
  }

  $url = 'http://www.yelp.com/biz/franchino-san-francisco?start=80';
  $root = yelp($url);
  var_dump($root);


?>

Update

I have FireFox with the Firebug extension installed. While viewing the web page, I right click on the data I want to capture and choose Inspect Element with FireBug. The debug window opens with the HTML element already selected. I right click on that element and choose Copy CSS Path. That will give the full CSS selector for the element. Normally it's way too specific and can be reduced to just a few elements. I then review the HTML structure (already open in the debug window) to determine what I can eliminate. At that point it's just a matter of knowing CSS selectors. Hope that helps. It may take some practice but you will find that technique invaluable for any type of HTML/CSS work.

Firefox Web Browser

Firebug Web Development Tool

Learn CSS at W3Schools

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top