Question

I'm planning to deploy a Google Search Appliance to remotely index an intranet site (transcontinentally). So I will be using the company's network and potentially consuming too much bandwidth. Regarding the configurations that I can use to mitigate the effect of the initial crawl (which is the only one that is perceived as dangerous for the network) we have:

  • Crawl and Index > Host Load Schedule
    • Web Server Host Load: basically number of concurrent connections to the crawled servers within 1 minute, so minimizing this setting should
    • Exceptions to Web Server Host Load: this is a schedule used for either increasing or decreasing the number of concurrent connections to the crawled server.
  • Crawl and Index > Crawl Schedule
    • Instead of a continous crawl I should choose a Scheduled crawl.

Am I on the right track and can other settings be configured in order not to generate excessive network traffic between the GSA and the Web servers?

Was it helpful?

Solution

The best way to minimize the crawling of a remote site is to not crawl it. Failing that, there are a couple of settings will help it it as noted out above:

1) Host Load Schedule

This sets the number of current threads set to the crawler for the host. Note that this can be a number below 1. (i.e. 2.5) (also noted by BigMikeW)

2) Freshness Tuning

Crawl infrequently actually means "Crawl never again". This works well in conjunction with a meta-url feed which will tell the GSA to recrawl the page or a recrawl request from the administrative console. Crawl frequently actually means: "Crawl Once Per Day". This setting doesn't really mean much now that the crawler has been retuned and the hardware is faster. The GSA will submit requests intra daily to the pages it finds.

3) Crawl schedule

I find that it's not better to turn off the crawler but rather keep it on continuous mode and set the threshold at zero. This allows the natural GSA algorithms to play out. Anything you wish to achieve by scheduling can be achieved by tuning it to zero for the periods you want the crawler quiet.

My recommendation for minimizing WAN traffic: 1) Review DNS and add an override if necessary to ensure you are routing to nearest content source 2) Set the content sources pattern to crawl infrequently 3) Create a meta url feed to push content updates.

The last one would take a bit of coding. There is an example sitemap feeder here: https://code.google.com/p/gsafeedmanager/

With this configuration, the GSA will never recrawl the content and will rely on the feed to inform it of updates.

Alternate: 1) Ensure the content source responds to HEAD requests with LAST Modified Dates. Do not configure crawl infrequently. The GSA will detect deltas and slow the crawl down over time.

OTHER TIPS

Yes, I would also look at the Freshness Tuning and Duplicate Hosts.

  • Host Load Schedule

    • Web Server Host Load
    • Exceptions to Web Server Host Load
  • Crawl Schedule

    • Crawl Mode
  • Freshness Tuning

    • Crawl Frequently
    • Crawl Infrequently

As Tan Hong Tat says, look at Freshness Tuning and Duplicate Hosts. I would set it to crawl infrequently at least until the initial crawl has completed.

Also do some content analysis. Using the Crawl patterns you can direct the GSA to ignore certain content types (based on file extension) or areas of the intranet that don't contain content of value to the search experience.

When you're setting the host load remember that you can use decimal values between 0-1, e.g.: 0.1.

If they have a decent WAN optimizer in place you may find this is less of an issue than you think.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top