Question

I want to retrieve an HTML element in a page.

<h2 id="resultCount" class="resultCount">

    <span>

        Showing 1 - 12 of 40,923 Results

    </span>

</h2>

I have to get the total number of results for the test in my php.

For now, I get all that is between the h2 tags and I explode the first time with space. Then I explode again with the comma to concatenate able to convert numbers results in European format. Once everything's done, I test my number results.

define("MAX_RESULT_ALL_PAGES", 1200);    
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
                $htmlResultCountPage = file_get_html($queryUrl);
                $htmlResultCount = $htmlResultCountPage->find("h2[id=resultCount]");
                $resultCountArray = explode(" ", $htmlResultCount[0]);

                $explodeCount = explode(',', $resultCountArray[5]);
                  $europeFormatCount = '';
                  foreach ($explodeCount as $val) {
                           $europeFormatCount .= $val;
                   }
                if ($europeFormatCount > MAX_RESULT_ALL_PAGES) {*/

                    $queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;

                }

At the moment the total number of results is not well recovered and the condition does not happen even when it should.

Someone would have a solution to this problem or any other way?

Était-ce utile?

La solution

I would simply fetch the page as a string (not html) and use a regular expression to get the total number of results. The code would look something like this:

define('MAX_RESULT_ALL_PAGES', 1200);

$queryUrl    = AMAZON_TOTAL_BOOKS_COUNT . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
$queryResult = file_get_contents($queryUrl);

if (preg_match('/of\s+([0-9,]+)\s+Results/', $queryResult, $matches)) {
    $totalResults = (int) str_replace(',', '', $matches[1]);
} else {
    throw new \RuntimeException('Total number of results not found');
}

if ($totalResults > MAX_RESULT_ALL_PAGES) {
    $queryUrl = AMAZON_SEARCH_URL . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
    // ...
}

Autres conseils

A regex would do it:

...
preg_match("/of ([0-9,]+) Results/", $htmlResultCount[0], $matches);
$europeFormatCount = intval(str_replace(",", "", $matches[1]));
...

Please try this code.

define("MAX_RESULT_ALL_PAGES", 1200);  

// new dom object
$dom = new DOMDocument();

// HTML string
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$html_string = file_get_contents($queryUrl);

//load the html
$html = $dom->loadHTML($html_string);

//discard white space 
$dom->preserveWhiteSpace = TRUE;

//Get all h2 tags
$nodes = $dom->getElementsByTagName('h2');

// Store total result count
$totalCount = 0;

// loop over the all h2 tags and print result
foreach ($nodes as $node) {
    if ($node->hasAttributes()) {
        foreach ($node->attributes as $attribute) {
            if ($attribute->name === 'class' && $attribute->value == 'resultCount') {
                $inner_html = str_replace(',', '', trim($node->nodeValue));
                $inner_html_array = explode(' ', $inner_html);

                // Print result to the terminal 
                $totalCount += $inner_html_array[5];
            }
        }
    }
}

// If result count grater than 1200, do this
if ($totalCount > MAX_RESULT_ALL_PAGES) {
      $queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}

Give this a try:

$match =array();
preg_match('/(?<=of\s)(?:\d{1,3}+(?:,\d{3})*)(?=\sResults)/', $htmlResultCount, $match);
$europeFormatCount = str_replace(',','',$match[0]);

The RegEx reads the number between "of " and " Results", it matches numbers with ',' seperator.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top