Question

http://www.site.com/shop/maxi-dress?colourId=94&optId=694
http://www.site.com/shop/maxi-dress?colourId=94&optId=694&product_type=sale

I have thousands of URLs like the above. Different combinations and names. I also have duplicates of these URLs which have the query string product_type=sale

I want to disable Google from indexing anything with product_type=sale

Is this possible in robots.txt

Was it helpful?

Solution

Google supports wildcards in robots.txt. The following directive in robots.txt will prevent Googlebot from crawling any page that has any parameters:

Disallow: /*?

This won't prevent many other spiders from crawling these URLs because wildcards are not a part of the standard robots.txt.

Google may take its time to remove the URLs that you have blocked from the search index. The extra URLs may still be indexed for months. You can speed the process up by using the "Remove URLs" feature in webmaster tools after they have been blocked. But that is a manual process where you have to paste in each individual URL that you want to have removed.

It may also hurt your site's Google rankings to use this robots.txt rule in the case that Googlbot doesn't find the version of the URL without parameters. If you commonly link to the versions with parameters you probably don't want to block them in robots.txt. It would be better to use one of the other options below.


A better option is to use the rel canonical meta tag on each of your pages.

So both your example URLs would have the following in the head section:

<link rel="canonical" href="http://www.site.com/shop/maxi-dress">

That tells Googlebot not to index so many variations of the page, only to index the "canonical" version of the URL that you choose. Unlike using robots.txt, Googlebot will still be able to crawl all your pages and assign value to them, even when they use a variety of URL parameters.


Another option is to log into Google Webmaster Tools and use the "URL Parameters" feature that is in the "Crawl" section.

Once there, click on "Add parameter". You can set "product_type" to "Does not affect page content" so that Google doesn't crawl and index pages with that parameter.

enter image description here

Do the same for each of the parameters that you use that don't change the page.

OTHER TIPS

Yes this is quite straight forward to do. Add the following line in your robots.txt file:

Disallow: /*product_type=sale

The preceding wild card (*) means any URLs that contain product_type=sale will no longer be crawled by Google.

Although they may still stay in Google's index if they were there previously, but Google will no longer crawl them, and when viewed in a Google search will say : A description for this result is not available because of this site's robots.txt – learn more.

Further reading here: Robots.txt Specifications

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top