Change relative url with CURL [duplicate]

https://stackoverflow.com/questions/16383143

14-04-2022
|

题

I'm trying to scrape some websites using CURL. In order to change the relative URL's I have inserted this:

 $curl_scraped_page = preg_replace("/<head>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

It's working good for most websites but not all of them. For instance this website "NS Website" show's no effect at all, meaning the URL's are completed with my domain as base url: mydomain.com/css.css

This is the complete code Im using:

<?php

$url = $_GET['url'];

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);

$curl_scraped_page = preg_replace("/<head>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

curl_close($ch);

echo $curl_scraped_page;

?>

Live example at phpfiddle

解决方案

Your problem is in the regular expression.

You are looking for <head>, but the given example's website has a <head profile="http://gmpg.org/xfn/11">.

Replace your regular expression with :

$curl_scraped_page = preg_replace("/<head.*>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow