I'm trying to scrape some websites using CURL. In order to change the relative URL's I have inserted this:

 $curl_scraped_page = preg_replace("/<head>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

It's working good for most websites but not all of them. For instance this website "NS Website" show's no effect at all, meaning the URL's are completed with my domain as base url: mydomain.com/css.css

This is the complete code Im using:

<?php

$url = $_GET['url'];

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);

$curl_scraped_page = preg_replace("/<head>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

curl_close($ch);

echo $curl_scraped_page;

?>

Live example at phpfiddle

有帮助吗?

解决方案

Your problem is in the regular expression.

You are looking for <head>, but the given example's website has a <head profile="http://gmpg.org/xfn/11">.

Replace your regular expression with :

$curl_scraped_page = preg_replace("/<head.*>/i", "<head><base href='$url' />", $curl_scraped_page, 1);
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top