Question

i'm trying to write a script to scrape canonical URL from a remote URL. I'm not a professional developper, so if something is ugly in my code, any explanation would (and will) be appreciated.

What I'm trying to do is either look for:

<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />
<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />`

... and extract the URL out of it.

My code so far :

    $content = file_get_contents($url);
    $content = strtolower($content);
    $content = preg_replace("'<style[^>]*>.*</style>'siU",'',$content);  // strip js
    $content = preg_replace("'<script[^>]*>.*</script>'siU",'',$content); // strip css
    $split = explode("\n",$content); // Separate each line

    foreach ($split as $k => $v) // For each line
    {
        if (strpos(' '.$v,'<meta') || strpos(' '.$v,'<link')) // If contains a <meta or <link
        {
        // Check with regex and if found, return what I need (the URL)
        }
    }
    return $split_content;

I've been fighting with regex for hours, trying to figure out how to do so, but it seems it's well above my knowledge.

would someone know how I need to define this rule ? Plus, does my script seems okay to you, or is there room for improvement ?

Thanks a bunch !

Was it helpful?

Solution

Using DOMDocument this is how you can get the property and content

$html = '<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('meta') as $meta) {
    if ($meta->hasAttributes()) {
        foreach ($meta->attributes as $attribute) {
            $attr[$attribute->nodeName] = $attribute->nodeValue;
        }
    }
}

print_r($attr);

Output ::

Array
(
    [property] => og:url
    [content] => http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html
)

The same you can get for the 2nd URL as

$html = '<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('link') as $link) {
    if ($link->hasAttributes()) {
        foreach ($link->attributes as $attribute) {
            $attr[$attribute->nodeName] = $attribute->nodeValue;
        }
    }
}


print_r($attr);

Output ::

Array
(
    [rel] => canonical
    [href] => http://www.another-canonical-url.com/is-here
)

OTHER TIPS

Consider using DOMDocument, simply load your HTML into the DOMDocument object and use getElementsByTagName and then loop the results until one of them has the right attributes. As if you were writing Javascript.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top