Question

I am trying to scrape all the urls on the home page on my client's site so I can migrate it to wordpress. The problem is I can't seem to arrive at a de-duplicated list of urls.

Here's the code:

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
   $href = $hrefs->item($i);
   $url = $href->getAttribute('href');

   if($url = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])){
    $urls = $matches[0][0][0];
    $list = implode( ', ', array_unique( explode(", ", $urls) ) );
    echo $list . '<br/>';
    //print_r($list);
   }
}

(Also posted here.)

Instead I am getting duplicates like this:

http://www.catwalkyourself.com/rss.php
http://www.catwalkyourself.com/rss.php

How do I fix this?

Was it helpful?

Solution

The last part of your code shouldn't be in the loop. You're traversing an array containing every links on the page. As each element of this array contains only one link, you're applying array_unique on an array which can't contain more than one element.

Try something like this:

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$urls = array();

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');

    if($url = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])){
        $urls[] = $matches[0][0][0];
    }
}
$list = implode(', ', array_unique($urls));
echo $list . '<br/>';

OTHER TIPS

The way the code is structured with the loop right now, you are always calling array_unique with an array size of 1.

You need to build a list of URLs and then call array_unique. Try this:

<?php

$html = file_get_contents('http://www.catwalkyourself.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$urls  = array();

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url  = $href->getAttribute('href');

    if( ($count = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])) > 0) {
        $urls[] = $matches[0][0][0]; // build list of URLs in the loop
    }
}

$list = implode( ', ', array_unique( $urls ) );
echo $list . '<br/>';
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top