
BLAST 출력이있는 거대한 파일이 있으며, 중복 된 선을 생략하는 가장 낮은 E-VALUE로 쿼리 ID, 제목 GI 및 프레임 (기본적으로 전체 줄)을 선택해야합니다 (다른 모든 줄을 다른 모든 줄을 생략하면 다른 모든 줄을 생략하십시오).짐마자이것은 파일이 어떻게 생겼는지 :

# BLASTX 2.2.28+
# 0 hits found
# BLASTX 2.2.28+
# Query: Tx6_c1_seq1
# Database: /mnt/swissprot
# Fields: query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames
# 24 hits found
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx6_c1_seq1 577332  RecName: Full=Putative E3 ubiquitin-protein ligase SINAT1; AltName: Full=Seven in absentia homolog 1    305 1   9   179 111 171 3e-05   67  32.79   3/0
Tx6_c1_seq1 3548505 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 419 2   9   173 209 267 8e-05   65  32.20   3/0
Tx6_c1_seq1 577547  RecName: Full=E3 ubiquitin-protein ligase siah2; AltName: Full=Seven in absentia homolog 2; AltName: Full=Xsiah-2   313 1   15  173 125 181 2e-04   62  29.82   3/0
Tx6_c1_seq1 577417  RecName: Full=E3 ubiquitin-protein ligase Siah1; AltName: Full=Seven in absentia homolog 1; Short=Siah-1    282 1   15  173 96  152 3e-04   62  29.82   3/0
Tx6_c1_seq1 577554  RecName: Full=E3 ubiquitin-protein ligase SINAT2; AltName: Full=Seven in absentia homolog 2 308 1   9   179 114 174 4e-04   67  31.15   3/0
# BLASTX 2.2.28+
# Query: Tx_11_c0_seq1
# Database: /mnt/swissprot
# Fields: query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames
# 1 hits found
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056    15  957 28  147 455 8e-13   79  27.81   -2/0
# BLASTX 2.2.28+
# Query: Tx_11_c1_seq1

이 경우 예상 출력은 가장 작은 e_value가있는 것과 같이이 두 줄 만 있어야합니다.

Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056

코드를 작성했지만 작동하지 않는 것 같습니다.이 문제를 해결하는 데 도움이되도록 도와주세요.나는 당신의 시간과 도움을 정말로 감사 드리겠습니다.이것은 내가 지금까지 가지고있는 것입니다 :

#!/usr/bin/perl -w

# Author:
# 01/07/2014
# This script removes duplicate records from a "short" format BLAST output file, and keeps only the "best" records  (sorts by smallest e-value and then biggest percent identity)
# Usage: bestblast.pl <input file> <output file>

#Deal with passed parameters
#If no arguments are passed, show usage message and exit program.
if ($#ARGV == -1) {
    usage("BLAST BEST 1.0 2014");

#get the names of the input file (first argument passed) and output file (second argument passed)
$in_file = $ARGV[0];
$out_file = $ARGV[1];

#Open the input file for reading, open the output file for writing.
#If either are unsuccessful, print an error message and exit program.
unless ( open(IN, "$in_file") ) {
    usage("Got a bad input file: $in_file");
unless ( open(OUT, ">$out_file") ) {
    usage("Got a bad output file: $out_file");

#Everything looks good. Print the parameters we've found.
print "Parameters:\ninput file = $in_file\noutput file = $out_file\n\n";

#The main event

$counter = 0;
$total_counter = 0;

print "De-duplicating File...\n";

@in = <IN>;

#Do stuff for each line of text in the input file.
foreach $line (@in) {
    #if the line starts with a pound symbol, it is not real data, so skip this line.
    if ( $line =~ /^#/ ) {

    #Count the total number of data lines in the file.

    #The chomp commands removes any new line (and carriage return) characters from the end of the line.

    #Split up the tab delimited line, naming only the variables we are interested in (i.e. query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames)
    ($query_id, $subject_gi, $subject_title, $subject_length, $gap_opens, $q_start, $q_end, $s_start, $s_end, $evalue, $subject_coverage, $identity, $query_sbjct_frames) = split(/\t/, $line);

    #check to see if the id label is already in the list of ids (called dedupe)
    #if its not there, add it.
    if ( $dedupe{$query_id} ) {
    #if it is, look at the old line to see if it is still "better" than the new one.
    ($query_id, $subject_gi, $subject_title, $subject_length, $gap_opens, $q_start, $q_end, $s_start, $s_end, $list_evalue, $subject_coverage, $list_identity, $query_sbjct_frames) = split(/\t/,$dedupe{$query_id});

    #if the new evalue is better than the old one, change the value of this id to the new line.
    #otherwise, if the the new evalue is the same, and the percent_identity is better, change the value of this id to the new line.
    #otherwise, don't do anything (keep the old line).
    if ( $evalue < $list_evalue ) {
        $dedupe{$query_id} = $line;
    elsif ( $evalue == $list_evalue ) {
        if ( $identity > $list_identity ) {
        $dedupe{$query_id} = $line;
    else {
    $dedupe{$query_id} = $line;
    #count the number of non-duplicated lines we have.
print "Total # records = $total_counter\nBest only # records = $counter\n";
print "Writing to output file...\n";

#Print the final "dedupe" list to the new file (adding the new line back on the end).
foreach $query_id (sort keys %dedupe) {
    print OUT "$dedupe{$query_id}\n";

#Close the files.
print "Done.\n";

sub usage {
    my($message) = @_;
    print "\n$message\n";

    print "\nThis script removes duplicate records from a \"short\" format BLAST output file, and keeps only the \"best\" records.\nIt sorts by smallest e-value and then biggest percent identity.\n";
    print "Usage: bestbenter code herelast.pl <input file> <output file>\n";
    print "\n Author \n";
    print "01/07/2014\n";

도움이 되었습니까?


Shebang 후 "엄격한 사용"을 추가 해보십시오. 다른 것을 찾는 데 도움이 될 수 있습니다.

($ dedupe {$ query_id}) "(정의 된 ($ dedupe {$ query_id}))"을 "

에 대체하십시오.

대부분의 사람들은 생물 학자 / 게놈 주의자 (!)가 아니며 당신이 무엇을 말하는지 전혀 모른다는 것을 명심하십시오, 우리는 우리에게 아무런 의미가 없으므로 숫자와 단어를 볼 수 있습니다. 그래서 더 잘 설명 할 수 있다면, 우리는더 많은 것을 도울 수있게하십시오.

다음은 더 유리한 것입니다.

next if $line =~ /^#/;

코드는 항상 64 행에서 81 번까지 간다. 결코 2 차 테스트를 전혀 입력하지 않습니다. 복제물을 찾지 못하지 않습니다.디버거에서 실행 해보십시오.

perl -d yourprog INFILE OUTFILE

다음 "다음 줄"에 "n"을 반복해서 수행하십시오.변수 값을 "p 변수 이름"으로 인쇄 할 수 있습니다.

탭에서 공간에서 SPLIT () s의 구분 기호를 변경하면 적어도 올바른 출력 레코드가 올바른 수의 출력 레코드가 있습니다.

De-duplicating File...
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "AltName:" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Total # records = 7
Best only # records = 2
Writing to output file...
iMac:~/tmp: more out
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056    15  957 28  147 455 8e-13   79  27.81   -2/0

다른 팁

왜 당신은 자신만의 폭발 파서를 쓰려고 노력하고 있습니까?BioPerl

를 사용하십시오


나는 더 이상 perl을 너무 많이 사용하지 않지만 여기에 무엇을 해야할지에 대한 거친 아이디어

while (my $result = $report->next_result) {
    print "Query: ".$result->query_name."\n";
    while (my $hit = $result->next_hit) {
        while ($hsp = $hit->next_hsp) {
            my evalue = $hsp->evalue;
            #convert to decimal notation
            $decimal_notation = sprintf("%.10g", $scientific_notation);

            ##... i'll leave the rest up to you

값은 과학적 표기법에 있으며, Perl은 비교보다 작을 때 문자열처럼 취급합니다.

나는 또한 내가 dedup 물건을 다르게 할 것입니다 ...

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top