GooglebotがRobots.txtを尊重しない[終了]

https://stackoverflow.com/questions/463569

19-08-2019
|

質問

何らかの理由で、Googleウェブマスターツールの<！> quot; Analyze robots.txt <！> quot; robots.txtファイルによってブロックされているURLを確認するために、私は期待していません。ファイルの先頭からの抜粋を次に示します。

Sitemap: http://[omitted]/sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

GooglebotとMediapartners-Googleの両方で、スクリプトフォルダー内のすべてが正しくブロックされます。 Googlebotはスクリプトを7行目からブロックし、Mediapartners-Googleは4行目からブロックされていると言っているため、2つのロボットに正しいディレクティブが表示されていることがわかります。 -agentディレクティブはブロックされません！

コメントまたは絶対URLの使用が問題を引き起こしているのかどうか疑問に思っています...

どんな洞察も歓迎します。ありがとう。

解決

これらが無視される理由は、robots.txtエントリのDisallowファイルに完全修飾URLがあり、仕様では許可されていません。（相対パスのみを指定するか、/を使用して絶対パスを指定する必要があります）。次を試してください：

Sitemap: /sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: /Living/books/book-review-not-stupid.aspx
Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

キャッシュについては、Googleはrobots.txtファイルのコピーを平均24時間ごとに取得しようとします。

他のヒント

これは絶対URLです。 robots.txtは相対URIのみを含むことになっています。ドメインは、robots.txtがアクセスされたドメインに基づいて推測されます。

少なくとも1週間は稼働していますが、Googleは3時間前に最後にダウンロードされたと言っているので、最近のものであると確信しています。

robots.txtファイルに最近この変更を加えましたか？私の経験では、グーグルがそのようなものを非常に長い間キャッシュしているようです。

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow