Question

Background info:

  • I'm collecting some URLs dynamically from various sources online.
  • I would like to get the URL's content if it's an HTML page or an image.
  • I do not want to load large files (like a download zip, pdf or others) - just to realize that the target is not interesting for me.

Is there a way I can check the response type/format with PHP before actually fetching the content? (to avoid wasting my own and the target servers resources and bandwidth)

(I found get_headers() in the PHP doc, but it is unclear to me, if the function actually fetches the entire content and returns the headers, or somehow only gets the headers from the server, without downloading the content first. I also found solutions to get headers with CURL and fsocketopen, but the question remains, if I can do it without loading actual content)

Était-ce utile?

La solution 2

There is a PHP-function for that:

$headers=get_headers("http://www.amazingjokes.com/img/2014/530c9613d29bd_CountvonCount.jpg");
print_r($headers);

returns the following:

Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Tue, 11 Mar 2014 22:44:38 GMT
    [2] => Server: Apache
    [3] => Last-Modified: Tue, 25 Feb 2014 14:08:40 GMT
    [4] => ETag: "54e35e8-8873-4f33ba00673f4"
    [5] => Accept-Ranges: bytes
    [6] => Content-Length: 34931
    [7] => Connection: close
    [8] => Content-Type: image/jpeg
)

Should be easy to get the content-type after this.

More reading here (PHP.NET)

Autres conseils

Try using an HTTP HEAD request to retrieve only the headers. Something like:

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD');

or (what the manual recommends):

curl_setopt($ch, CURLOPT_NOBODY, true);

(I haven't tested either of these.)

Here is a solution using cURL with a CURLOPT_WRITEFUNCTION callback function. In it, I check the incoming header to find the content type. If it's not what we want, it tells cURL to abort, so you don't waste time getting the body of the request.

$ch = curl_init('http://stackoverflow.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);

$data = '';
$haveHeader = false;

curl_setopt($ch, CURLOPT_WRITEFUNCTION, function($ch, $chunk) use (&$haveHeader, &$data) {
    if (!$haveHeader && ($chunk == "\n" || $chunk == "\r\n")) {
        // detected end of header
        $haveHeader = true;
    } else if (!$haveHeader) {
        // detected content type
        if (preg_match('/content-type:\s*([^;]+)/i', $chunk, $matches)) {
            $contentType = strtolower($matches[1]);
            // check if content type is what we want
            if ($contentType != 'text/html' && strpos($contentType, 'image/') === false) {
                // tell curl to abort
                return false;
            }
        }
    } else {
        // append to data (body/content)
        $data .= $chunk;
    }

    return strlen($chunk);
});

if (curl_exec($ch)) {
    // use $data here
    echo strlen($data);
}
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top