Question

I try to access and use different pages in NCBI such as
http://www.ncbi.nlm.nih.gov/nuccore/NM_000036 However, when I used perl's LWP::Simple 'get' function, I do not get the same output I get when I save the page manually (with the firefox browser 'save as html' option). What I do get from the 'get' function lacks the data I require.

Am I doing something wrong? Should I use another tool?

My script is :

use strict;
use warnings;
use LWP::Simple;


my $input_name='GENES.txt';

open (INPUT, $input_name ) || die "unable to open $input_name";
open (OUTPUT,'>', 'Selected_Genes')|| die;

my $line;


while ($line = <INPUT>)
{

    chomp $line;
    print OUTPUT '>'.$line."\n";
    my $URL='http://www.ncbi.nlm.nih.gov/nuccore/'.$line;
#e.g:
#$URL=http://www.ncbi.nlm.nih.gov/nuccore/NM_000036

    my $text=gets($URL);
    print $text."\n";   
    $text=~m!\r?\n\r?\s+\/translation="((?:(?:[^"])\r?\n?\r?)*)"!;
    print OUTPUT $1."\n";

}

Thanks in advance!

Was it helpful?

Solution 2

Content you're searching is generated by JavaScript. You need to parse your HTML (from the first response) and find ID for the data you want:

<meta name="ncbi_uidlist" content="289547499" />

Next you need to make another request to the URL in the form: http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=ID_YOU_HAVE

Something like this (untested!): my $URL='http://www.ncbi.nlm.nih.gov/nuccore/'.$line;

my $html=gets($URL);

my ($id) = $html =~m{name="ncbi_uidlist" \s+ content="([^"]+)"}xi;
if ($id) {
    $html=gets( "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=" . $id );
    $text=~m!\r?\n\r?\s+\/translation="((?:(?:[^"])\r?\n?\r?)*)"!;
    print OUTPUT $1."\n";
}

OTHER TIPS

The page at http://www.ncbi.nlm.nih.gov/nuccore/NM_000036 does a lot of JavaScript processing and also loads a bunch of stuff dynamically. LWP::UserAgent does not do that for you as it cannot run JavaScript.

I suggest you take a look at what is happening in your browser, with Firebug or the Chrome Developer Tools. You'll see it does an XHR request to this URL: http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=289547499&db=nuccore&dopt=genbank&extrafeat=976&fmt_mask=0&retmode=html&withmarkup=on&log$=seqview&maxdownloadsize=1000000

Now I am not sure how these params translate to the NM_000036, but you should be able to figure that out by looking at some of the JS code that is being run on the page, or trying multiple pages ans looking at the URLs of the XHR calls.

Since this is probably a public service, and I'm assuming you are allowed to take that data, you should consider asking if they have a proper API that you can hit instead of screen scraping the stuff off of their website.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top