i'm using robots.txt to exclude some pages from spiders.

User-agent: * 
Disallow: /track.php

When i search something refeered to this page, google says: "A description for this result is not available because of this site's robots.txt – learn more."

It means that the robots.txt is working.. but why the link to the page is still found by the spider? I'd like to have no link to the 'track.php' page... how i should setup the robots.txt? (or something like .htaccess and so on..?)

有帮助吗?

解决方案

Here's what happened:

  • Googlebot saw, on some other page, a link to track.php. Let's call that page "source.html".
  • Googlebot tried to visit your track.php file.
  • Your robots.txt told Googlebot not to read the file.

So Google knows that source.html links to track.php, but it doesn't know what track.php contains. You didn't tell Google not to index track.php; you told Googlebot not to read and index the data inside track.php.

As Google's documentation says:

While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.

There's not a lot you can do about this. For your own pages, you can use the x-robots-tag or noindex meta tag as described in that documentation. That will prevent Googlebot from indexing the URL if it finds a link in your pages. But if some page that you don't control links to that track.php file, then Google is quite likely to index it.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top