Perl의 해시에 큰 파일로드 (Blast Tables)

https://stackoverflow.com//questions/9630911

10-12-2019
|

문제

저는 Perl 초보자입니다. 내 쿼리로 나를 도와주세요 ... Blast 테이블에서 정보를 추출하려고합니다 (아래처럼 보이는 것의 스 니펫) : 그것은 표준 폭발 테이블 입력입니다 ... 기본적으로 읽기 목록에 대한 정보를 추출하고 싶습니다 (아래의 두 번째 스크립트를보고 내가 원하는 것에 대한 아이디어를 얻으려면). 두 번째 스크립트에서 수행 :

입력 :

1) 폭발 테이블 :

38.1    0.53    59544   GH8NFLV01A02ED  GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345  1   YP_003242370    Dynamin family protein [Paenibacillus sp. Y412MC10] -1  0   48.936170212766 40.4255319148936    47  345 1213    13.6231884057971    3.87469084913438    31  171 544 590
34.3    7.5 123828  GH8NFLV01A03QJ  GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1   XP_002639994    Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3   0   52.1739130434783    32.6086956521739    46  452 367 10.1769911504425    12.5340599455041    111 248 79  124
37.7    0.70    62716   GH8NFLV01A09B8  GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1   XP_002756773    PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus]   1   0   73.5294117647059    52.9411764705882    34  512 703 6.640625    4.83641536273115    43  144 273 306
37.7    0.98    33114   GH8NFLV01A0H5C  GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1   XP_002756773    PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus]   -3  0   73.5294117647059    52.9411764705882    34  573 703 5.93368237347295    4.83641536273115    131 232 273 306
103 1e-020  65742   GH8NFLV01A0MXI  GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475  1   ABZ08973    hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14]  2   0   77.9411764705882    77.9411764705882    68  475 151 14.3157894736842    45.0331125827815    2   205 1   68
41.6    0.053   36083   GH8NFLV01A0QKX  GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1   XP_766153   hypothetical protein [Theileria parva strain Muguga]    -1  0   66.6666666666667    56.6666666666667    30  526 304 5.70342205323194    9.86842105263158    392 481 31  60
45.4    0.003   78246   GH8NFLV01A0Z29  GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1   ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1]   3   0   51.8518518518518    38.8888888888889    54  432 193 12.5    27.979274611399 48  209 97  150
71.6    4e-011  97250   GH8NFLV01A14MR  GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314  1   ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244]  1   0   92.5    92.5    40  314 311 12.7388535031847    12.8617363344051    193 312 13  52
58.2    5e-007  154555  GH8NFLV01A1KCH  GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1   ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244]  1   0   82.051282051282 82.051282051282 39  267 311 14.6067415730337    12.540192926045 142 258 1   39

2) 읽기 목록 :

GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc

3) 내가 원하는 출력 :

37.7    0.70    62716   GH8NFLV01A09B8  GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1   XP_002756773    PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus]   1   0   73.5294117647059    52.9411764705882    34  512 703 6.640625    4.83641536273115    43  144 273 306
38.1    0.53    59544   GH8NFLV01A02ED  GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345  1   YP_003242370    Dynamin family protein [Paenibacillus sp. Y412MC10] -1  0   48.936170212766 40.4255319148936    47  345 1213    13.6231884057971    3.87469084913438    31  171 544 590

4 번째 열에있는 읽기 이름 목록이 지정된 첫 번째 목록의 정보의 하위 집합을 원합니다 (4 번째 열에있는 것) 읽기 목록을 해싱하는 대신 BLAST 테이블 자체를 해시시키고 키 (BLAST 테이블의 컬럼 4의 정보를 사용하여 각 키의 값을 추출하여 해당 키가 더 많은 경우에도 사용됩니다. 하나의 값보다 (즉, 각 읽기 이름은 실제로 하나 이상의 히트 또는 관련 블래스트 결과가 있거나 테이블의 발사 결과가 발생할 수 있음) 값이 해당 키 (READNAME)가있는 전체 행이 포함되어 있음을 명심하십시오.

My Greplist.pl 스크립트는 이것이 그렇지만 매우 느리고, 나는 해시에 전체 테이블을 적재해야한다는 것을 해시에 전체 테이블을 적재해야한다고 생각합니다.

도움을 주셔서 감사합니다.

내 스크립트 : 부러진 것 (mambo5.pl)

#!/usr/bin/perl -w
# purpose:  extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");

my $readName;

while ( <LIST> )
{
    #########;
    if(/^(.*?)$/)#
    {
        $readName=$1;#
        chomp $readName;
        if (exists $hash{$readName})
        {
            print "bingo!";
            my $output =$hash{$readName};
            print OUT "$output\n";
        }
        else 
        {
            print "it aint workin\n";
            #print %hash;
        }           
    }
}
close (LIST);

천천히 그리고 빠른 속임수 (작품)가 매우 느립니다 (내 폭발 테이블은 약 400MB에서 2GB 큰 경우, 왜 그렇게 느려지는지 알 수 있습니다)

#!/usr/bin/perl -w
## 
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script

use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
    if ($line=~/^(.*)$/) 
    {
        $read = $1;
        print "$read\n";
        system("grep \"$read\" $db >$read\_.out\n");
    }


    #system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");

나는 4 번째 열을 키로 정의하는 방법을 모른다 : 어쩌면 나는 분할 함수를 사용할 수 있지만, 2 개의 열이없는 테이블에 대해 이것을 사용하지 않는 방법을 찾으려고 노력했다. .. 도와주세요! 쉬운 방법이 있으면 알려주세요

감사합니다!

해결책 3

Voila, 2 ways of doing this, one with nothing to do with perl :

awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table

Mambo6.pl

#!/usr/bin/perl -w
# purpose:  extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time. 
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
    #chomp;
    if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
    {
        #print "$1\n";
        $key= $5;#read
        $val= $1;#whole row; notice the brackets around the whole match.
        $hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
    }
    else {
        print "something wrong with format";
    }
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");

my $readName;

while ( <LIST> )
{
    #########;
    if(/^(.*?)$/)#
    {
        $readName=$1;#
        chomp $readName;
        if (exists $hash{$readName})
        {
            print "$readName\n";
            my $output =$hash{$readName};
            print OUT "$output";
        }
        else 
        {
            #print "it aint workin\n";
        }           
    }
}
close (LIST);
close (OUT);

The oneliner is faster, and probably better than my script, I'm sure some people can find easier ways to do it... I just thought I'd put this up since it does what I want.

다른 팁

I'd do the opposite i.e read the readslist file into a hash then walk thru the big blast file and print the desired lines.

#!/usr/bin/perl 
use strict;
use warnings;
use 5.010;

# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;

open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
    chomp;
    # retrieve the key (4th column)
    my ($key) = (split/\s+/)[3];
    # print the line if the key exists in the hash
    say $_ if exists $readslist{$key};
}
close $fh_blast;

I suggest you build an index to turn your blasts file temporarily into an indexed-sequential file. Read through it and build a hash of addresses within the file where every record for each key starts.

After that it is just a matter of seeking to the correct places in the file to pick up the records required. This will certainly be faster than most simple solutions, as it entails read the big file only once. This example code demonstrates.

use strict;
use warnings;

use Fcntl qw/SEEK_SET/;

my %index;

open my $blast, '<', 'blast.txt' or die $!;

until (eof $blast) {
  my $place = tell $blast;
  my $line = <$blast>;
  my $key = (split ' ', $line, 5)[3];
  push @{$index{$key}}, $place;
}

open my $reads, '<', 'reads.txt' or die $!;

while (<$reads>) {

  next unless my ($key) = /(\S+)/;
  next unless my $places = $index{$key};

  foreach my $place (@$places) {
    seek $blast, $place, SEEK_SET;
    my $line = <$blast>;
    print $line;
  }
}

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow