Question

Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?

Was it helpful?

Solution

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Slurp
Allow: /
User-Agent: msnbot
Disallow: 

Slurp is Yahoo's robot

OTHER TIPS

Why?

Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary.

But — if you insist on doing it anyway — that's what the User-Agent: line in robots.txt is for.

User-agent: googlebot
Disallow: 

User-agent: *
Disallow: /

With lines for all the other search engines you'd like traffic from, of course. Robotstxt.org has a partial list.

There are more than 3 major search engines depending on which country you are talking. Facebook seem to be doing a good job listing only legitimate ones: https://facebook.com/robots.txt

So your robots.txt can be something like:

User-agent: Applebot
Allow: /

User-agent: baiduspider
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Facebot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: msnbot
Allow: /

User-agent: Naverbot
Allow: /

User-agent: seznambot
Allow: /

User-agent: Slurp
Allow: /

User-agent: teoma
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: Yandex
Allow: /

User-agent: Yeti
Allow: /

User-agent: *
Disallow: /

As everyone know, the robots.txt is a standard to be obeyed by the crawler and hence only well-behaved agents do so. So, putting it or not doesn't matter.

If you have some data, that you do not show on the site as well, you can just change the permission and improve the security.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top