I have configured multiple hostnames on our shared hosting account hosting MVC4 website. I did it for loading static resources from these multiple hostname to gain some speed by making parallel requests. All these hostnames are mapped to the same site / application in IIS. And then we changed URLs of static resources to load them from these hostnames. Basically, its like loading from a CDN (we're not really using CDN, but just making it load it parallel.)

However, I want to block search engines and other crawlers to access these multiple-hostnames / subdomains. Otherwise it will list them in search listings.

I think to add robots.txt, but these domains are using same application. So I it's already have robots.txt which is there for my main domain.

Any idea on how to prevent crawlers from crawling these additional hostnames?

有帮助吗?

解决方案

Add below rule to your web.config under node.

<rewrite>
  <rules>
    <rule name="Imported Rule 1" stopProcessing="true">
      <match url="^robots\.txt$" ignoreCase="false" />
      <conditions>
        <add input="{HTTP_HOST}" pattern="^cdn\.yourdomain\.com$" />
      </conditions>
      <action type="Rewrite" url="/cdn.robots.txt" />
    </rule>
  </rules>
</rewrite>

其他提示

In Google Webmaster Tools you can set preferences for "canonicalization". This is the terminology used to describe duplicate content with a preferred source (more precisely it refers to the preferred source itself). Google have a discussion of their policies on duplicate content and canonicalization in the answers section of Webmaster Tools.

To summarise the page the simplest/best approach is to set a "preferred domain" in your Webmaster Tools site settings and set up link elements in your duplicate pages with rel="canonical" to indicate your preferred source for SEO purposes.

If you want http://www.example.com/dresses/greendress.html to be the canonical URL for your listing, you can indicate this to search engines by adding a element with the attribute rel="canonical" to the section of the non-canonical pages. To do this, create a link as follows:

<link rel="canonical" href="http://www.example.com/dresses/greendress.html">

Canonical links are not specific to Google. They are defined in RFC 6596 and are known to also be supported by Yahoo and Bing since 2009.

In regard to the link relation type, "canonical" can be described informally as the author's preferred version of a resource. More formally, the canonical link relation specifies the preferred IRI from a set of resources that return the context IRI's content in duplicated form. Once specified, applications such as search engines can focus processing on the canonical, and references to the context (referring) IRI can be updated to reference the target (canonical) IRI.

Setting up canonical links does not prevent search engines from crawling your duplicate pages but it should ensure your page rank and search links are correctly assigned (which is really the important part). In theory GoogleBot and other crawlers should eventually figure out which base url is the real content and shouldn't crawl your duplicate content as often or as intensely as your "primary" pages.

To avoid this issue, it is recommended to upload static contents in one subdomain and point all your CDN resources to your subdomain. Then block your subdomain using robots.txt file or using Google Webmaster tools.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top