طريقة سهلة لاختبار عنوان URL لـ 404 في PHP؟

https://stackoverflow.com/questions/408405

03-07-2019
|

سؤال

أقوم بتعليم نفسي بعض عمليات الاستخراج الأساسية ووجدت أنه في بعض الأحيان يكون عنوان URL الذي أقوم بإدخاله في الكود الخاص بي يرجع 404، والذي يلخص بقية الكود الخاص بي.

لذلك أحتاج إلى إجراء اختبار في الجزء العلوي من الكود للتحقق مما إذا كان عنوان URL يعرض 404 أم لا.

قد تبدو هذه مهمة واضحة ومباشرة، لكن Google لا تقدم لي أي إجابات.أشعر بالقلق من أنني أبحث عن الأشياء الخاطئة.

أوصت إحدى المدونات باستخدام هذا:

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

ثم اختبر لمعرفة ما إذا كان $ صالحًا إذا كان فارغًا أم لا.

لكنني أعتقد أن عنوان URL الذي يسبب لي المشاكل يحتوي على إعادة توجيه، لذلك يظهر $valid فارغًا لجميع القيم.أو ربما أفعل شيئًا خاطئًا آخر.

لقد بحثت أيضًا في "الطلب الرئيسي" ولكنني لم أجد بعد أي أمثلة أكواد فعلية يمكنني اللعب بها أو تجربتها.

اقتراحات؟وما هذا حول الضفيرة؟

المحلول

إذا كنت تستخدم PHP في curl الارتباطات ، يمكنك التحقق من رمز الخطأ باستخدام curl_getinfo على هذا النحو:

$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);

/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
    /* Handle 404 here. */
}

curl_close($handle);

/* Handle $response here. */

نصائح أخرى

إذا PHP5 الجري يمكنك استخدام:

$url = 'http://www.example.com';
print_r(get_headers($url, 1));

وبدلا من ذلك مع PHP4 ساهم مستخدم ما يلي:

/**
This is a modified version of code from "stuart at sixletterwords dot com", at 14-Sep-2005 04:52. This version tries to emulate get_headers() function at PHP4. I think it works fairly well, and is simple. It is not the best emulation available, but it works.

Features:
- supports (and requires) full URLs.
- supports changing of default port in URL.
- stops downloading from socket as soon as end-of-headers is detected.

Limitations:
- only gets the root URL (see line with "GET / HTTP/1.1").
- don't support HTTPS (nor the default HTTPS port).
*/

if(!function_exists('get_headers'))
{
    function get_headers($url,$format=0)
    {
        $url=parse_url($url);
        $end = "\r\n\r\n";
        $fp = fsockopen($url['host'], (empty($url['port'])?80:$url['port']), $errno, $errstr, 30);
        if ($fp)
        {
            $out  = "GET / HTTP/1.1\r\n";
            $out .= "Host: ".$url['host']."\r\n";
            $out .= "Connection: Close\r\n\r\n";
            $var  = '';
            fwrite($fp, $out);
            while (!feof($fp))
            {
                $var.=fgets($fp, 1280);
                if(strpos($var,$end))
                    break;
            }
            fclose($fp);

            $var=preg_replace("/\r\n\r\n.*\$/",'',$var);
            $var=explode("\r\n",$var);
            if($format)
            {
                foreach($var as $i)
                {
                    if(preg_match('/^([a-zA-Z -]+): +(.*)$/',$i,$parts))
                        $v[$parts[1]]=$parts[2];
                }
                return $v;
            }
            else
                return $var;
        }
    }
}

وكلاهما سيكون له نتيجة مشابهة ل:

Array
(
    [0] => HTTP/1.1 200 OK
    [Date] => Sat, 29 May 2004 12:28:14 GMT
    [Server] => Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT
    [ETag] => "3f80f-1b6-3e1cb03b"
    [Accept-Ranges] => bytes
    [Content-Length] => 438
    [Connection] => close
    [Content-Type] => text/html
)

لذلك هل يمكن أن تحقق لمعرفة أن رد رأس كان موافق على سبيل المثال:

$headers = get_headers($url, 1);
if ($headers[0] == 'HTTP/1.1 200 OK') {
//valid 
}

if ($headers[0] == 'HTTP/1.1 301 Moved Permanently') {
//moved or redirect page
}

W3C وتعاريف

ومع كود strager، ويمكنك أيضا التحقق من CURLINFO_HTTP_CODE رموز أخرى. بعض المواقع لا يبلغون 404، بل ببساطة إعادة توجيه إلى صفحة مخصصة 404 والعودة 302 (إعادة توجيه) أو شيئا من هذا القبيل. أنا استخدم هذا لمعرفة ما اذا كان ملف الفعلي (مثل ملف robots.txt) موجودة على خادم أم لا. من الواضح أن هذا النوع من الملفات لا يسبب إعادة توجيه إذا كان موجودا، ولكن إذا لم يفعل ذلك فإنه إعادة توجيه إلى صفحة 404، والتي كما قلت من قبل قد لا يكون لها رمز 404.

function is_404($url) {
    $handle = curl_init($url);
    curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

    /* Get the HTML or whatever is linked in $url. */
    $response = curl_exec($handle);

    /* Check for 404 (file not found). */
    $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
    curl_close($handle);

    /* If the document has loaded successfully without any redirection or error */
    if ($httpCode >= 200 && $httpCode < 300) {
        return false;
    } else {
        return true;
    }
}

وكما يوحي strager، والنظر في استخدام الضفيرة. قد تكون مهتمة ايضا في وضع CURLOPT_NOBODY مع curl_setopt أيضا لتخطي تحميل كليا الصفحة (كنت ترغب فقط في رؤوس).

إذا كنت تبحث عن الحل الأسهل واحدة يمكنك محاولة دفعة واحدة على PHP5 تفعل

file_get_contents('www.yoursite.com');
//and check by echoing
echo $http_response_header[0];

ولقد وجدت هذه الإجابة هنا :

if(($twitter_XML_raw=file_get_contents($timeline))==false){
    // Retrieve HTTP status code
    list($version,$status_code,$msg) = explode(' ',$http_response_header[0], 3);

    // Check the HTTP Status code
    switch($status_code) {
        case 200:
                $error_status="200: Success";
                break;
        case 401:
                $error_status="401: Login failure.  Try logging out and back in.  Password are ONLY used when posting.";
                break;
        case 400:
                $error_status="400: Invalid request.  You may have exceeded your rate limit.";
                break;
        case 404:
                $error_status="404: Not found.  This shouldn't happen.  Please let me know what happened using the feedback link above.";
                break;
        case 500:
                $error_status="500: Twitter servers replied with an error. Hopefully they'll be OK soon!";
                break;
        case 502:
                $error_status="502: Twitter servers may be down or being upgraded. Hopefully they'll be OK soon!";
                break;
        case 503:
                $error_status="503: Twitter service unavailable. Hopefully they'll be OK soon!";
                break;
        default:
                $error_status="Undocumented error: " . $status_code;
                break;
    }

وأساسا، يمكنك استخدام الأسلوب "ملف الحصول على محتويات" لاسترداد URL التي بملء تلقائيا متغير رأس استجابة HTTP مع رمز الحالة.

إضافة؛ اختبر تلك الطرق الثلاثة مع الأخذ في الاعتبار الأداء.

النتيجة، على الأقل في بيئة الاختبار الخاصة بي:

الضفيرة يفوز

يتم إجراء هذا الاختبار مع الأخذ في الاعتبار أن هناك حاجة فقط إلى الرؤوس (noBody).اختبر نفسك:

$url = "http://de.wikipedia.org/wiki/Pinocchio";

$start_time = microtime(TRUE);
$headers = get_headers($url);
echo $headers[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";


$start_time = microtime(TRUE);
$response = file_get_contents($url);
echo $http_response_header[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

$start_time = microtime(TRUE);
$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle, CURLOPT_NOBODY, 1); // and *only* get the header 
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// if($httpCode == 404) {
    // /* Handle 404 here. */
// }
echo $httpCode."<br>";
curl_close($handle);
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

وكما تلميح إضافي للإجابة مقبولة كبيرة:

عند استخدام الاختلاف من الحل المقترح، حصلت أخطاء بسبب فب إعداد "max_execution_time. وذلك ما فعلته كان ما يلي:

set_time_limit(120);
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
$result = curl_exec($curl);
set_time_limit(ini_get('max_execution_time'));
curl_close($curl);

أولا أنا تعيين المهلة إلى عدد أكبر من ثانية، في النهاية أنا تعيينها مرة أخرى إلى القيمة المحددة في إعدادات بي.

<?php

$url= 'www.something.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);   
curl_setopt($ch, CURLOPT_NOBODY, true);    
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);


echo $httpcode;
?>

وهذا سوف تعطيك صحيح إذا رابط لا يرجع 200 OK

function check_404($url) {
   $headers=get_headers($url, 1);
   if ($headers[0]!='HTTP/1.1 200 OK') return true; else return false;
}

ويمكنك استخدام هذا الرمز أيضا، لمعرفة حالة اي رابط:

<?php

function get_url_status($url, $timeout = 10) 
{
$ch = curl_init();
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
            CURLOPT_URL => $url,            // set URL
            CURLOPT_NOBODY => true,         // do a HEAD request only
            CURLOPT_TIMEOUT => $timeout);   // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE); // find HTTP status
curl_close($ch); // close handle
echo $status; //or return $status;
    //example checking
    if ($status == '302') { echo 'HEY, redirection';}
}

get_url_status('http://yourpage.comm');
?>

وهنا هو حل قصير.

$handle = curl_init($uri);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle,CURLOPT_HTTPHEADER,array ("Accept: application/rdf+xml"));
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 200||$httpCode == 303) 
{
    echo "you might get a reply";
}
curl_close($handle);

في الحالة الخاصة بك، يمكنك تغيير application/rdf+xml إلى ما تستخدمها.

وهذا هو مجرد وشريحة من التعليمات البرمجية، يعمل أمل بالنسبة لك

            $ch = @curl_init();
            @curl_setopt($ch, CURLOPT_URL, 'http://example.com');
            @curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
            @curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            @curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            @curl_setopt($ch, CURLOPT_TIMEOUT, 10);

            $response       = @curl_exec($ch);
            $errno          = @curl_errno($ch);
            $error          = @curl_error($ch);

                    $response = $response;
                    $info = @curl_getinfo($ch);
return $info['http_code'];

لالتقاط جميع الأخطاء: 4XX و5XX، وأنا استخدم هذا البرنامج النصي قليلا:

function URLIsValid($URL){
    $headers = @get_headers($URL);
    preg_match("/ [45][0-9]{2} /", (string)$headers[0] , $match);
    return count($match) === 0;
}

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow