Googlebot guesses urls. How to avoid/handle this crawling

https://stackoverflow.com/questions/20855082

23-09-2022
|

Question

Googlebot is crawling our site. Based on our URL structure it is guessing new possible URLs.

Our structure is of the kind /x/y/z/param1.value. Now google bot exchanges the values of x,y,z and value with tons of different keywords. Problem is, that each call triggers a very expensive operation and it will return positive results only in very rare cases.

I tried to set an url parameter in the crawling section of the webmasters tools (param1. -> no crawling). But this seems not to work, probably cause of our inline url format (would it be better to use the html get format ?param1=..?)

As Disallow: */param1.* seems not to be an allowed robots.txt entry, is there another way to disallow google from crawling this sites?

As another solution I thought of detecting the googlebot and returning him a special page. But I have heard that this will be punished by google.

Currently we always return a http status code 200 and a human readable page, which says: "No targets for your filter critera found". Would It help to return another status code?

Solution

Note: This is probably not a general answer!

Joachim was right. It turned out that the googlebot is not guessing URLs.

Doing a bit of research I found out that I added a new DIV in my site containing those special URLs half a year ago (which I unfortunately forgot). A week ago googlebot has started crawling it.

My solution: I deleted the DIV and also I return a 404 status code on those URLs. I think, sooner or later, googlebot will now stop crawling the URLs after revisiting my site.

Thanks for the help!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow