Best way to crawl a page with multiple redirects

Question

I should think this site is very crawlable. To understand what is going on, turn off JavaScript in your browser and try to browse the site (to do this, I use the Disable->Disable JavaScript menu in Firebug, which is a Firefox plugin).

If you go to your first link, and paste in your string, you get a form in a POST operation that effectively says your search is in progress. It will look something like this:

Job Title: Protein Sequence (333 letters)

Request ID: NR8ZP8E1071

Since there is not much of interest on this screen, I am assuming that you do not want to scrape from here - but that is effectively what you are currently doing.

What happens next is that a piece of JavaScript submits a hidden form, using this code:

<SCRIPT LANGUAGE="JavaScript">
setTimeout('document.forms[0].submit();',1000);
</SCRIPT>

My guess is that at times of heavy load, the delay here (presently set to 1000ms i.e. 1 second) would increase a bit. The hidden form looks like this:

<form action="Blast.cgi" enctype="application/x-www-form-urlencoded" method="post" name="RequestFormat" id="RequestFormat&quot;">               
<input name="CMD" value="Get" type="hidden">
<input name="ALIGNMENTS" value="100" type="hidden">
<input name="ALIGNMENT_VIEW" value="Pairwise" type="hidden">
<input name="BLAST_PROGRAMS" value="blastp" type="hidden">
<input name="CDD_RID" value="data_cache_seq:180192" type="hidden">
<input name="CDD_SEARCH" value="on" type="hidden">
<input name="CDD_SEARCH_STATE" value="4" type="hidden">
<input name="CLIENT" value="web" type="hidden">
<input name="COMPOSITION_BASED_STATISTICS" value="2" type="hidden">
<input name="CONFIG_DESCR" value="2,3,4,5,6,7,8" type="hidden">
<input name="DATABASE" value="nr" type="hidden">
<input name="DESCRIPTIONS" value="100" type="hidden">
<input name="EQ_OP" value="AND" type="hidden">
<input name="EXPECT" value="10" type="hidden">
<input name="FILTER" value="F" type="hidden">
<input name="FORMAT_NUM_ORG" value="1" type="hidden">
<input name="FORMAT_OBJECT" value="Alignment" type="hidden">
<input name="FORMAT_TYPE" value="HTML" type="hidden">
<input name="FULL_DBNAME" value="nr" type="hidden">
<input name="GAPCOSTS" value="11 1" type="hidden">
<input name="GET_SEQUENCE" value="on" type="hidden">
<input name="HSP_RANGE_MAX" value="0" type="hidden">
<input name="JOB_TITLE" value="Protein Sequence (333 letters)" type="hidden">
<input name="LAYOUT" value="OneWindow" type="hidden">
<input name="LINE_LENGTH" value="60" type="hidden">
<input name="MASK_CHAR" value="2" type="hidden">
<input name="MASK_COLOR" value="1" type="hidden">
<input name="MATRIX_NAME" value="BLOSUM62" type="hidden">
<input name="MAX_NUM_SEQ" value="100" type="hidden">
<input name="MYNCBI_USER" value="9311188414" type="hidden">
<input name="NEW_VIEW" value="on" type="hidden">
<input name="NUM_DIFFS" value="0" type="hidden">
<input name="NUM_OPTS_DIFFS" value="0" type="hidden">
<input name="NUM_ORG" value="1" type="hidden">
<input name="NUM_OVERVIEW" value="100" type="hidden">
<input name="OLD_BLAST" value="false" type="hidden">
<input name="OLD_VIEW" value="false" type="hidden">
<input name="PAGE" value="Proteins" type="hidden">
<input name="PAGE_TYPE" value="BlastSearch" type="hidden">
<input name="PROGRAM" value="blastp" type="hidden">
<input name="QUERY_INDEX" value="0" type="hidden">
<input name="QUERY_INFO" value="Protein Sequence (333 letters)" type="hidden">
<input name="QUERY_LENGTH" value="333" type="hidden">
<input name="REPEATS" value="5755" type="hidden">
<input name="RID" value="NR8ZP8E1071" type="hidden">
<input name="RTOE" value="21" type="hidden">
<input name="SELECTED_PROG_TYPE" value="blastp" type="hidden">
<input name="SERVICE" value="plain" type="hidden">
<input name="SHORT_QUERY_ADJUST" value="on" type="hidden">
<input name="SHOW_LINKOUT" value="on" type="hidden">
<input name="SHOW_OVERVIEW" value="on" type="hidden">
<input name="USER_DEFAULT_MATRIX" value="4" type="hidden">
<input name="USER_DEFAULT_PROG_TYPE" value="blastp" type="hidden">
<input name="USER_TYPE" value="2" type="hidden">
<input name="WORD_SIZE" value="3" type="hidden">
<input name="db" value="protein" type="hidden">
<input name="stype" value="protein" type="hidden">
<input name="x" value="41" type="hidden">
<input name="y" value="12" type="hidden">
</form>

This also creates a POST request to the program, and of most interest is the field RID which links the request with your initial query parameters. This is probably stored in a database or temporary file, and is assigned an ID, which expires in a matter of hours.

When this form is submitted, lots of interesting information is provided, rendered inside the POST request of the form that created it. It is possible that one of the above fields specifies the initial number of alignments to show. If you then turn JavaScript back on, you'll find that pointing at the end of the page (which itself is several screenfuls already) will load another chunk using this program:

http://blast.ncbi.nlm.nih.gov/t2g.cgi?CMD=Get&RID=NR8ZP8E1071&OLD_BLAST=false&DESCRIPTIONS=0&NUM_OVERVIEW=0&GET_SEQUENCE=on&DYNAMIC_FORMAT=on&ALIGN_SEQ_LIST=gi|160797,gi|9816,gi|121273,gi|428230092,gi|417051&HSP_SORT=0&SEQ_LIST_START=1&QUERY_INDEX=0&SHOW_LINKOUT=on&ALIGNMENT_VIEW=Pairwise&MASK_CHAR=2&MASK_COLOR=1&LINE_LENGTH=60

Interestingly, a GET request is used here. Using the network monitor in Firefox, I triggered a series of these to see if I could spot a sequence of incrementing numbers. I spotted that SEQ_LIST_START starts at 1 and increments in blocks of 5, but I am not sure where the elements in ALIGN_SEQ_LIST comes from - maybe from the current page. It's worth you having a look yourself to see if you can spot anything - especially since you will understand the subject matter in a way that I do not.

You may be able to tinker around with some of the query string parameters in this link to see what controls the number of items returned. However, be careful: if you request a much larger set that their systems are used to, you may be noticed and have a block placed on your IP address.

Further to that, remember that if you crawl a website, you are passing your costs onto a third party. Since the data appears to be available for free, this will be acceptable to them to some degree, and is the benefit of the funding they have already spent. However, be mindful of the load you are placing on their server: don't request chunks that are excessively large, and put a few seconds delay between each request.

If you plan to grab an enormous chunk of data (say more than half a gigabyte), then alternate between a few seconds and a couple of minutes waiting, or perhaps concentrate your downloading during the night (their time) when their servers might be less busy. Failure to "act responsibly" as a crawler may place your IP range on their blocklists, and in the worst cases could constitute a denial of service attack.

So, to summarise, here's what you need to do:

Make the initial POST request that retrieves the form
Wait a few seconds
Grab the response (in particular with the request ID) and resubmit that data in a new POST
Harvest the data from the screen
Use GET requests in the second program to get new data
Harvest the new data from response

Be willing to tinker with your POST and GET parameters to see the effect, and have fun!