Question

I have urls like these: /products/:product_id/deals/new /products/:product_id/deals/index

I'd like to disallow the "deals" folder in my robots.txt file.

[Edit] I'd like to disallow this folder for Google, Yahoo and Bing Bots. Does anyone know if these bots support wildcard character and so would support the following rule?

Disallow: /products/*/deals

Also... Do you have any really good tuto on robots.txt rules? As I didn't manage to find a "really" good one I could use one...

And one last question: Is the robots.txt the best way to handle this? Or should I better use the "noindex" meta?

Thx you all! :)

Was it helpful?

Solution

Yes, all the major search engines support basic wildcards *, and your solution will work to disallow your deals directory.

The best place to learn about robots.txt is really the Google Developer page. It provides plenty of examples of what works and what doesn't. For instance, many people don't know that robots.txt files are protocol specific. So if you want to block pages on a https connection, you'll need to make sure you have a robots.txt at https://yoursite.com/robots.txt

You can also test a new robots.txt file before applying it through Google Webmaster Tools. Basically you can verify with the search engine whether or not it'll actually work before you deploy it.

With regards to blocking something with robots.txt or just adding a noindex to the pages, I'm more inclined to use the noindex in most scenarios unless I know I don't want the search engines crawling that section of my site at all.

There are some trade offs. When you block the search engine altogether, you can save on some of your "crawl budget". So the search engines will crawl other pages then "waste" their time on pages you don't want them to visit. However, those URLs can still appear in the search results.

If you absolutely don't want any search referral traffic to those pages, it's better to use the noindex directive. Additionally, if you link to the deals page often, a noindex not only removes it from the search results, but any link value / PageRank can flow through those pages and can be calculated accordingly. If you block them from being crawled, it's sort of a blackhole.

OTHER TIPS

If you are not sure whether your syntax in robots.txt is correct, you can test it on https://www.google.com/webmasters (to see if there are any errors). Additionally, you can enter a page URL and the tool will tell you if according to your robots.txt it should be blocked or not.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top