disallow certain url in robots.txt [closed]

https://stackoverflow.com/questions/2848140

robots.txt

27-09-2019
|

Question

We implemented a rating system on a site a while back that involves a link to a script. However, with the vast majority of ratings on the site at 3/5 and the ratings very even across 1-5 we're beginning to suspect that search engine crawlers etc. are getting through. The urls used look like this:

http://www.thesite.com/path/to/the/page/rate?uid=abcdefghijk&value=3

When we started we add the following to our robots.txt:

User-agent: *
Disallow: /rate

Is this incorrect or are googlebot and others simply ignoring our robots.txt?

Solution

You should use POST for actions which change things as search engine usually do not submit forms. Additionally, this will prevent users who download your website recursively (e.g. with wget) from submitting tons of votes.

Depending on your site, handling voting though javascript might be a solution, too.

Regarding your robots.txt: It has to be in the root path - i.e. http://www.thesite.com/robots.txt - and if your rating system is at /blah/rate you need to use Disallow: /blah/rate instead of Disallow: /rate

OTHER TIPS

Looks incorrect to me. You're only disallowing access to http://www.thesite.com/rate (and pages below it IIRC). Plus some crawlers ignore robots.txt!

Better to make it so that ratings are only ever altered in response to a POST, rather than a GET. Search engines never use POST.

User-agent: *
Disallow: /path/to/the/page/rate

You have to use the full path.

Might want to read up here a bit: http://www.javascriptkit.com/howto/robots.shtml

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow