Question

I was reading the Google documentation on robots.txt and found that they will delay indexing your site unless they get a 404 or 200 response:

Before Googlebot crawls your site, it accesses your robots.txt file to determine if your site is blocking Google from crawling any pages or URLs. If your robots.txt file exists but is unreachable (in other words, if it doesn’t return a 200 or 404 HTTP status code), we’ll postpone our crawl rather than risk crawling disallowed URLs.

On my site I am using web.xml error codes to map error codes to a Spring MVC controller method:

<error-page>
  <error-code>404</error-code>
  <location>/showerror</location>
</error-page>

This will forward to an error page which returns 200 (but an HTML page).

When the URL requested is robots.txt I actually want the 404 to be returned un-handled. Is there an easy way to exempt a specific URL (/robots.txt) from this error handling?

Of course the other option is to return an empty robots.txt which doesn't block anything.

Was it helpful?

Solution

The approach I would probably use would be to actually handle it in a Controller:

@Controller
public class RobotsTxtController {
    @RequestMapping("/robots.txt")
    @ResponseStatus(HttpStatus.NOT_FOUND)
    public void robotsTxt() {}
}

You could have the method return a view name if you want an actual page to appear.

This way, it's an actual "page" and not handled by your normal 404 process, but the @ResponseStatus makes it return as a 404.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top