Prevent search engine crawlers access for multiple-hostnames used as CDN

Question 1

Add below rule to your web.config under node.

<rewrite>
  <rules>
    <rule name="Imported Rule 1" stopProcessing="true">
      <match url="^robots\.txt$" ignoreCase="false" />
      <conditions>
        <add input="{HTTP_HOST}" pattern="^cdn\.yourdomain\.com$" />
      </conditions>
      <action type="Rewrite" url="/cdn.robots.txt" />
    </rule>
  </rules>
</rewrite>

Question 2

In Google Webmaster Tools you can set preferences for "canonicalization". This is the terminology used to describe duplicate content with a preferred source (more precisely it refers to the preferred source itself). Google have a discussion of their policies on duplicate content and canonicalization in the answers section of Webmaster Tools.

To summarise the page the simplest/best approach is to set a "preferred domain" in your Webmaster Tools site settings and set up link elements in your duplicate pages with rel="canonical" to indicate your preferred source for SEO purposes.

If you want http://www.example.com/dresses/greendress.html to be the canonical URL for your listing, you can indicate this to search engines by adding a element with the attribute rel="canonical" to the section of the non-canonical pages. To do this, create a link as follows:

<link rel="canonical" href="http://www.example.com/dresses/greendress.html">

Canonical links are not specific to Google. They are defined in RFC 6596 and are known to also be supported by Yahoo and Bing since 2009.

In regard to the link relation type, "canonical" can be described informally as the author's preferred version of a resource. More formally, the canonical link relation specifies the preferred IRI from a set of resources that return the context IRI's content in duplicated form. Once specified, applications such as search engines can focus processing on the canonical, and references to the context (referring) IRI can be updated to reference the target (canonical) IRI.

Setting up canonical links does not prevent search engines from crawling your duplicate pages but it should ensure your page rank and search links are correctly assigned (which is really the important part). In theory GoogleBot and other crawlers should eventually figure out which base url is the real content and shouldn't crawl your duplicate content as often or as intensely as your "primary" pages.

Question 3

To avoid this issue, it is recommended to upload static contents in one subdomain and point all your CDN resources to your subdomain. Then block your subdomain using robots.txt file or using Google Webmaster tools.