Domanda

I have checked the logs and found that the search engines visits a lot of bogus URL's on my website. They are most likely from before a lot of the links were changed, and even though I have made 301 redirects some links have been altered in very strange ways and aren't recognized by my .htaccess file.

All requests are handled by index.php. If a response can't be created due to a bad URL a custom error page is presented instead. With simplified code index.php looks like this

try {
  $Request = new Request();
  $Request->respond();
} catch(NoresponseException $e) {
  $Request->presentErrorPage();
}

I just realized that this page returns a status 200 telling the bot that the page is valid even though it ain't.

Is it enough to add a header with 404 in the catch statement to tell the bots to stop visiting that page?

Like this:

header("HTTP/1.0 404 Not Found");

It looks OK when I tests it, but I'm worried that SE bots (and maybe user agents) will get confused.

È stato utile?

Soluzione

You're getting there. The idea is correct - you want to give them a 404. However, just one tiny correction: if the client queries using HTTP/1.1 and you answer using 1.0, some clients will get confused.

The way around this is as follows:

header($_SERVER['SERVER_PROTOCOL']." 404 Not Found");

Altri suggerimenti

A well-behaved crawler respects robots.txt at the top level of your site. If you want to exclude crawlers, then @SalmanA's response will work. A sample robots.txt file follows:

User-agent: *
Disallow: /foo/*
Disallow: /bar/*
Disallow: /hd1/*

It needs to be readable by all. Note this is not going to get users off the pages, just a bot that respects robots.txt, which most of them do.

The SE bots DO get confused when they see this:

HTTP/1.1 200 OK

<h1>The page your requested does not exist</h1>

Or this:

HTTP/1.1 302 Object moved
Location: /fancy-404-error-page.html

It is explained here:

Returning a code other than 404 or 410 for a non-existent page (or redirecting users to another page, such as the homepage, instead of returning a 404) can be problematic. Firstly, it tells search engines that there’s a real page at that URL. As a result, that URL may be crawled and its content indexed. Because of the time Googlebot spends on non-existent pages, your unique URLs may not be discovered as quickly or visited as frequently and your site’s crawl coverage may be impacted (also, you probably don’t want your site to rank well for the search query File not found).

Your idea about programmatically sending the 404 header is correct and it instructs the search engine that the URL they requested does not exist and they should not attempt to crawl and index it. Ways to set response status:

header($_SERVER["SERVER_PROTOCOL"] . " 404 Not Found");

header(":", true, 404);  // this is used to set a header AND modify the http response code
                         // ":" is used as a hack to avoid specifying a real header

http_response_code(404); // PHP >= 5.4
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top