Perl脚本从BLAST结果中选择最佳BLAST基因线

https://stackoverflow.com//questions/21004024

20-12-2019
|

题

我有BLAST输出的巨大文件，我需要选择查询ID，主题GI和帧（基本上整行），省略重复行（省略与其他更高的电子值省略所有其他行）。这就是文件的样子：

# BLASTX 2.2.28+
# 0 hits found
# BLASTX 2.2.28+
# Query: Tx6_c1_seq1
# Database: /mnt/swissprot
# Fields: query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames
# 24 hits found
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx6_c1_seq1 577332  RecName: Full=Putative E3 ubiquitin-protein ligase SINAT1; AltName: Full=Seven in absentia homolog 1    305 1   9   179 111 171 3e-05   67  32.79   3/0
Tx6_c1_seq1 3548505 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 419 2   9   173 209 267 8e-05   65  32.20   3/0
Tx6_c1_seq1 577547  RecName: Full=E3 ubiquitin-protein ligase siah2; AltName: Full=Seven in absentia homolog 2; AltName: Full=Xsiah-2   313 1   15  173 125 181 2e-04   62  29.82   3/0
Tx6_c1_seq1 577417  RecName: Full=E3 ubiquitin-protein ligase Siah1; AltName: Full=Seven in absentia homolog 1; Short=Siah-1    282 1   15  173 96  152 3e-04   62  29.82   3/0
Tx6_c1_seq1 577554  RecName: Full=E3 ubiquitin-protein ligase SINAT2; AltName: Full=Seven in absentia homolog 2 308 1   9   179 114 174 4e-04   67  31.15   3/0
# BLASTX 2.2.28+
# Query: Tx_11_c0_seq1
# Database: /mnt/swissprot
# Fields: query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames
# 1 hits found
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056    15  957 28  147 455 8e-13   79  27.81   -2/0
# BLASTX 2.2.28+
# Query: Tx_11_c1_seq1

这种情况下的预期输出应该只有这两行，因为它们是具有最小e_value的那些：

Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056

我有我的代码，但似乎没有工作。你可以帮我解决这个问题吗？我真的很感激你的时间和帮助。这是我到目前为止的内容：

#!/usr/bin/perl -w

# Author:
# 01/07/2014
# This script removes duplicate records from a "short" format BLAST output file, and keeps only the "best" records  (sorts by smallest e-value and then biggest percent identity)
# Usage: bestblast.pl <input file> <output file>

#-----------------------------------------------------------------------------------------------------------------------------------------
#Deal with passed parameters
#-----------------------------------------------------------------------------------------------------------------------------------------
#If no arguments are passed, show usage message and exit program.
if ($#ARGV == -1) {
    usage("BLAST BEST 1.0 2014");
    exit;
}

#get the names of the input file (first argument passed) and output file (second argument passed)
$in_file = $ARGV[0];
$out_file = $ARGV[1];

#Open the input file for reading, open the output file for writing.
#If either are unsuccessful, print an error message and exit program.
unless ( open(IN, "$in_file") ) {
    usage("Got a bad input file: $in_file");
    exit;
}
unless ( open(OUT, ">$out_file") ) {
    usage("Got a bad output file: $out_file");
    exit;
}

#Everything looks good. Print the parameters we've found.
print "Parameters:\ninput file = $in_file\noutput file = $out_file\n\n";

#-----------------------------------------------------------------------------------------------------------------------------------------
#The main event
#-----------------------------------------------------------------------------------------------------------------------------------------

$counter = 0;
$total_counter = 0;

print "De-duplicating File...\n";

@in = <IN>;

#Do stuff for each line of text in the input file.
foreach $line (@in) {
    #if the line starts with a pound symbol, it is not real data, so skip this line.
    if ( $line =~ /^#/ ) {
     next;
     }

    #Count the total number of data lines in the file.
    $total_counter++;

    #The chomp commands removes any new line (and carriage return) characters from the end of the line.
    chomp($line);

    #Split up the tab delimited line, naming only the variables we are interested in (i.e. query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames)
    ($query_id, $subject_gi, $subject_title, $subject_length, $gap_opens, $q_start, $q_end, $s_start, $s_end, $evalue, $subject_coverage, $identity, $query_sbjct_frames) = split(/\t/, $line);

    #check to see if the id label is already in the list of ids (called dedupe)
    #if its not there, add it.
    if ( $dedupe{$query_id} ) {
    #if it is, look at the old line to see if it is still "better" than the new one.
    ($query_id, $subject_gi, $subject_title, $subject_length, $gap_opens, $q_start, $q_end, $s_start, $s_end, $list_evalue, $subject_coverage, $list_identity, $query_sbjct_frames) = split(/\t/,$dedupe{$query_id});

    #if the new evalue is better than the old one, change the value of this id to the new line.
    #otherwise, if the the new evalue is the same, and the percent_identity is better, change the value of this id to the new line.
    #otherwise, don't do anything (keep the old line).
    if ( $evalue < $list_evalue ) {
        $dedupe{$query_id} = $line;
    }
    elsif ( $evalue == $list_evalue ) {
        if ( $identity > $list_identity ) {
        $dedupe{$query_id} = $line;
        }
    }
    }
    else {
    $dedupe{$query_id} = $line;
    #count the number of non-duplicated lines we have.
    $counter++;
    }
}
print "Total # records = $total_counter\nBest only # records = $counter\n";
print "Writing to output file...\n";

#Print the final "dedupe" list to the new file (adding the new line back on the end).
foreach $query_id (sort keys %dedupe) {
    print OUT "$dedupe{$query_id}\n";
}

#Close the files.
close(IN);
close(OUT);
print "Done.\n";

#-----------------------------------------------------------------------------------------------------------------------------------------
#Subroutines
#-----------------------------------------------------------------------------------------------------------------------------------------
sub usage {
    my($message) = @_;
    print "\n$message\n";

    print "\nThis script removes duplicate records from a \"short\" format BLAST output file, and keeps only the \"best\" records.\nIt sorts by smallest e-value and then biggest percent identity.\n";
    print "Usage: bestbenter code herelast.pl <input file> <output file>\n";
    print "\n Author \n";
    print "01/07/2014\n";
}

解决方案

在谢信后尝试在顶部添加“使用strict” - 它可能有助于找到别的东西。

尝试替换“如果（定义（$depeue {$ query_id}）”（$ dedupe {$ query_id}）“（$ dedupe {$ query_id}））”

牢记大多数人都没有生物学家/基因组师（！），不知道你在谈论什么，我们只是看到对我们没有任何意义的数字和单词，所以如果你可以更好地解释，我们可以能够帮助更多。

顺便说一句，以下是更加惯用的。

next if $line =~ /^#/;

代码始终从第64行到81。它永远不会进入次要测试 - 它永远不会找到重复。尝试在调试器中运行它，其中：

perl -d yourprog INFILE OUTFILE

. 然后重复为“n”进行“n”。您可以使用“p变量名称”打印变量值。

如果我将split（）s从选项卡更改为空间，我得到它 - 至少有正确的输出记录数量：

De-duplicating File...
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "AltName:" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Total # records = 7
Best only # records = 2
Writing to output file...
Done.
iMac:~/tmp: more out
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056    15  957 28  147 455 8e-13   79  27.81   -2/0

其他提示

为什么要编写自己的爆炸解析器。使用BioPerl

http://www.bioperl.org/wiki/howto:searchio#ncbi-blast_parsing_problems

我不再使用perl，但这是粗略的想法

while (my $result = $report->next_result) {
    print "Query: ".$result->query_name."\n";
    while (my $hit = $result->next_hit) {
        while ($hsp = $hit->next_hsp) {
            my evalue = $hsp->evalue;
            #convert to decimal notation
            $decimal_notation = sprintf("%.10g", $scientific_notation);

            ##... i'll leave the rest up to you
        }
     }
}

值在科学符号中，Perl将在执行少于比较时将其视为字符串。

我想我还会以不同的方式做的...

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow