Why can't I crawl bog-standard HTML internet sites?

https://sharepoint.stackexchange.com/questions/13799

16-10-2019
|

Question

This comes up in the crawl logs:

Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify that the account you are using has "Full Read" permissions on the SharePoint Web Application being crawled.

Now, this happens for a bunch of sites that only have one thing in common, they're static HTML.

This isn't a loopback problem, this happens no matter what site I point it at. It shouldn't be a content-access account problem because, hey, these are public-facing sites. What's going on?

Solution

HA! Nailed it.

I had a look at the logs of an IIS box in our DMZ that was showing the same activity. It turns out in IIS anonymous access was turned on (of course) as well as Windows Integrated Authentication. So what's happening is that the spider is trying to use its credentials (which are no good on this machine as it's not on the domain) instead of requesting the pages anonymously. If I turn off integrated authentication it indexs OK. So I guess I just change the content access account or some such for that content source.

OTHER TIPS

The service account by your crawler may not have any needed proxy settings configured.

Have a look in your ULS or Windows Application Event Logs for details.

Licensed under: CC-BY-SA with attribution

Not affiliated with sharepoint.stackexchange