Question

I'm trying to scrape some websites using CURL. In order to change the relative URL's I have inserted this:

 $curl_scraped_page = preg_replace("/<head>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

It's working good for most websites but not all of them. For instance this website "NS Website" show's no effect at all, meaning the URL's are completed with my domain as base url: mydomain.com/css.css

This is the complete code Im using:

<?php

$url = $_GET['url'];

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);

$curl_scraped_page = preg_replace("/<head>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

curl_close($ch);

echo $curl_scraped_page;

?>

Live example at phpfiddle

Was it helpful?

Solution

Your problem is in the regular expression.

You are looking for <head>, but the given example's website has a <head profile="http://gmpg.org/xfn/11">.

Replace your regular expression with :

$curl_scraped_page = preg_replace("/<head.*>/i", "<head><base href='$url' />", $curl_scraped_page, 1);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top