Optimizing Google Search Appliance on a remote server

Question 1

The best way to minimize the crawling of a remote site is to not crawl it. Failing that, there are a couple of settings will help it it as noted out above:

1) Host Load Schedule

This sets the number of current threads set to the crawler for the host. Note that this can be a number below 1. (i.e. 2.5) (also noted by BigMikeW)

2) Freshness Tuning

Crawl infrequently actually means "Crawl never again". This works well in conjunction with a meta-url feed which will tell the GSA to recrawl the page or a recrawl request from the administrative console. Crawl frequently actually means: "Crawl Once Per Day". This setting doesn't really mean much now that the crawler has been retuned and the hardware is faster. The GSA will submit requests intra daily to the pages it finds.

3) Crawl schedule

I find that it's not better to turn off the crawler but rather keep it on continuous mode and set the threshold at zero. This allows the natural GSA algorithms to play out. Anything you wish to achieve by scheduling can be achieved by tuning it to zero for the periods you want the crawler quiet.

My recommendation for minimizing WAN traffic: 1) Review DNS and add an override if necessary to ensure you are routing to nearest content source 2) Set the content sources pattern to crawl infrequently 3) Create a meta url feed to push content updates.

The last one would take a bit of coding. There is an example sitemap feeder here: https://code.google.com/p/gsafeedmanager/

With this configuration, the GSA will never recrawl the content and will rely on the feed to inform it of updates.

Alternate: 1) Ensure the content source responds to HEAD requests with LAST Modified Dates. Do not configure crawl infrequently. The GSA will detect deltas and slow the crawl down over time.

Question 2

Yes, I would also look at the Freshness Tuning and Duplicate Hosts.

Host Load Schedule
- Web Server Host Load
- Exceptions to Web Server Host Load
Crawl Schedule
- Crawl Mode
Freshness Tuning
- Crawl Frequently
- Crawl Infrequently

Question 3

As Tan Hong Tat says, look at Freshness Tuning and Duplicate Hosts. I would set it to crawl infrequently at least until the initial crawl has completed.

Also do some content analysis. Using the Crawl patterns you can direct the GSA to ignore certain content types (based on file extension) or areas of the intranet that don't contain content of value to the search experience.

When you're setting the host load remember that you can use decimal values between 0-1, e.g.: 0.1.

If they have a decent WAN optimizer in place you may find this is less of an issue than you think.