Script Perl pour sélectionner la meilleure ligne de gènes de souffle du résultat de l'explosion

https://stackoverflow.com//questions/21004024

20-12-2019
|

Question

J'ai d'énormes fichiers avec une sortie d'explosion et j'ai besoin de sélectionner l'ID de requête, du sujet GI et de la trame (essentiellement de la ligne entière) avec la valeur électronique la plus faible en omettant les lignes en double (omettant toutes les autres valeurs électroniques supérieures.).Voici comment le fichier ressemble à:

# BLASTX 2.2.28+
# 0 hits found
# BLASTX 2.2.28+
# Query: Tx6_c1_seq1
# Database: /mnt/swissprot
# Fields: query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames
# 24 hits found
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx6_c1_seq1 577332  RecName: Full=Putative E3 ubiquitin-protein ligase SINAT1; AltName: Full=Seven in absentia homolog 1    305 1   9   179 111 171 3e-05   67  32.79   3/0
Tx6_c1_seq1 3548505 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 419 2   9   173 209 267 8e-05   65  32.20   3/0
Tx6_c1_seq1 577547  RecName: Full=E3 ubiquitin-protein ligase siah2; AltName: Full=Seven in absentia homolog 2; AltName: Full=Xsiah-2   313 1   15  173 125 181 2e-04   62  29.82   3/0
Tx6_c1_seq1 577417  RecName: Full=E3 ubiquitin-protein ligase Siah1; AltName: Full=Seven in absentia homolog 1; Short=Siah-1    282 1   15  173 96  152 3e-04   62  29.82   3/0
Tx6_c1_seq1 577554  RecName: Full=E3 ubiquitin-protein ligase SINAT2; AltName: Full=Seven in absentia homolog 2 308 1   9   179 114 174 4e-04   67  31.15   3/0
# BLASTX 2.2.28+
# Query: Tx_11_c0_seq1
# Database: /mnt/swissprot
# Fields: query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames
# 1 hits found
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056    15  957 28  147 455 8e-13   79  27.81   -2/0
# BLASTX 2.2.28+
# Query: Tx_11_c1_seq1

La production attendue dans ce cas ne devrait être que ces deux lignes, car elles sont celles avec les plus petites e_value:

Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056

J'ai mon code écrit, mais ne semble pas fonctionner.Pourriez-vous les gars s'il vous plaît m'aider à résoudre ce problème.J'apprécierais vraiment votre temps et votre aide.C'est ce que j'ai jusqu'à présent:

#!/usr/bin/perl -w

# Author:
# 01/07/2014
# This script removes duplicate records from a "short" format BLAST output file, and keeps only the "best" records  (sorts by smallest e-value and then biggest percent identity)
# Usage: bestblast.pl <input file> <output file>

#-----------------------------------------------------------------------------------------------------------------------------------------
#Deal with passed parameters
#-----------------------------------------------------------------------------------------------------------------------------------------
#If no arguments are passed, show usage message and exit program.
if ($#ARGV == -1) {
    usage("BLAST BEST 1.0 2014");
    exit;
}

#get the names of the input file (first argument passed) and output file (second argument passed)
$in_file = $ARGV[0];
$out_file = $ARGV[1];

#Open the input file for reading, open the output file for writing.
#If either are unsuccessful, print an error message and exit program.
unless ( open(IN, "$in_file") ) {
    usage("Got a bad input file: $in_file");
    exit;
}
unless ( open(OUT, ">$out_file") ) {
    usage("Got a bad output file: $out_file");
    exit;
}

#Everything looks good. Print the parameters we've found.
print "Parameters:\ninput file = $in_file\noutput file = $out_file\n\n";

#-----------------------------------------------------------------------------------------------------------------------------------------
#The main event
#-----------------------------------------------------------------------------------------------------------------------------------------

$counter = 0;
$total_counter = 0;

print "De-duplicating File...\n";

@in = <IN>;

#Do stuff for each line of text in the input file.
foreach $line (@in) {
    #if the line starts with a pound symbol, it is not real data, so skip this line.
    if ( $line =~ /^#/ ) {
     next;
     }

    #Count the total number of data lines in the file.
    $total_counter++;

    #The chomp commands removes any new line (and carriage return) characters from the end of the line.
    chomp($line);

    #Split up the tab delimited line, naming only the variables we are interested in (i.e. query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames)
    ($query_id, $subject_gi, $subject_title, $subject_length, $gap_opens, $q_start, $q_end, $s_start, $s_end, $evalue, $subject_coverage, $identity, $query_sbjct_frames) = split(/\t/, $line);

    #check to see if the id label is already in the list of ids (called dedupe)
    #if its not there, add it.
    if ( $dedupe{$query_id} ) {
    #if it is, look at the old line to see if it is still "better" than the new one.
    ($query_id, $subject_gi, $subject_title, $subject_length, $gap_opens, $q_start, $q_end, $s_start, $s_end, $list_evalue, $subject_coverage, $list_identity, $query_sbjct_frames) = split(/\t/,$dedupe{$query_id});

    #if the new evalue is better than the old one, change the value of this id to the new line.
    #otherwise, if the the new evalue is the same, and the percent_identity is better, change the value of this id to the new line.
    #otherwise, don't do anything (keep the old line).
    if ( $evalue < $list_evalue ) {
        $dedupe{$query_id} = $line;
    }
    elsif ( $evalue == $list_evalue ) {
        if ( $identity > $list_identity ) {
        $dedupe{$query_id} = $line;
        }
    }
    }
    else {
    $dedupe{$query_id} = $line;
    #count the number of non-duplicated lines we have.
    $counter++;
    }
}
print "Total # records = $total_counter\nBest only # records = $counter\n";
print "Writing to output file...\n";

#Print the final "dedupe" list to the new file (adding the new line back on the end).
foreach $query_id (sort keys %dedupe) {
    print OUT "$dedupe{$query_id}\n";
}

#Close the files.
close(IN);
close(OUT);
print "Done.\n";

#-----------------------------------------------------------------------------------------------------------------------------------------
#Subroutines
#-----------------------------------------------------------------------------------------------------------------------------------------
sub usage {
    my($message) = @_;
    print "\n$message\n";

    print "\nThis script removes duplicate records from a \"short\" format BLAST output file, and keeps only the \"best\" records.\nIt sorts by smallest e-value and then biggest percent identity.\n";
    print "Usage: bestbenter code herelast.pl <input file> <output file>\n";
    print "\n Author \n";
    print "01/07/2014\n";
}

La solution

Essayez d'ajouter "Utiliser strict" en haut après le shebang - cela peut aider à localiser autre chose.

Essayez de remplacer "si ($ dédupe {$ query_id})" with "si (défini ($ dédupe {$ query_id}))"

Gardez à l'esprit que la plupart des gens ne sont donc pas des biologistes / génomistes (!) et ne savent pas de quoi vous parlez, nous voyons simplement des chiffres et des mots qui ne veulent rien dire pour nous, donc si vous pouvez expliquer mieux, nous pouvonsêtre capable d'aider plus.

Ce qui suit est plus idiomatique, au fait.

next if $line =~ /^#/;

Votre code va toujours de la ligne 64 à 81. Il n'entre jamais les tests secondaires du tout - il ne trouve jamais des doublons.Essayez de l'exécuter dans le débogueur, avec ceci:

perl -d yourprog INFILE OUTFILE

Alors do "n" pour "ligne suivante" à plusieurs reprises.Vous pouvez imprimer des valeurs variables avec «Nom variable P».

Si je change le séparateur pour la division () S de l'onglet en espace, je reçois ceci - qui a le nombre correct d'enregistrements de sortie au moins:

De-duplicating File...
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "AltName:" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Total # records = 7
Best only # records = 2
Writing to output file...
Done.
iMac:~/tmp: more out
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1   9   173 224 282 1e-06   65  32.20   3/0
Tx_11_c0_seq1   977285  RecName: Full=120.7 kDa protein in NOF-FB transposable element  1056    15  957 28  147 455 8e-13   79  27.81   -2/0

Autres conseils

Pourquoi essayez-vous d'écrire votre propre analyseur de souffle.Utilisez BIOPERL

http://www.bioperl.org/wiki/howto:rsearchIOIO#Ncbi-blast_parsing_problems

Je n'utilise pas trop de Perl, mais voici l'idée approximative de quoi faire

while (my $result = $report->next_result) {
    print "Query: ".$result->query_name."\n";
    while (my $hit = $result->next_hit) {
        while ($hsp = $hit->next_hsp) {
            my evalue = $hsp->evalue;
            #convert to decimal notation
            $decimal_notation = sprintf("%.10g", $scientific_notation);

            ##... i'll leave the rest up to you
        }
     }
}

La valeur est en notation scientifique, Perl traitera-la comme une chaîne lorsque vous faites moins que la comparaison.

Je pense que je ferais aussi la dédénité de choses différemment ...

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow