How to Rewrite of One Line Code (or Less Line Code in command line) of this code in Perl?

https://stackoverflow.com/questions/5402405

28-10-2019
|

Question

I have a code like that:

#!/usr/bin/perl
use strict;
use warnings;      
my %proteins = qw/
    UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
    CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
    AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
    GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
    /;
open(INPUT,"<dna.txt");
while (<INPUT>) {    
    tr/[a,c,g,t]/[A,C,G,T]/;
    y/GCTA/CGAU/;    
    foreach my $protein (/(...)/g) {
        if (defined $proteins{$protein}) {
        print $proteins{$protein};
        }
}
}
close(INPUT);

This code is related to my other question's answer: DNA to RNA and Getting Proteins with Perl

The output of the program is:

SIMQNISGREAT

How can I rewrite that code with Perl, it will run on command line and it will be rewritten with less code(if possible one line code)?

PS 1: dna.txt is like that:

TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT

PS 2: If the code will be less line, it is accepted to write the my %proteins variable into a file.

Solution

Somebody (@kamaci) called my name in another thread. This is the best I can come up with while keeping the protein table on the command line:

perl -nE'say+map+substr("FYVDINLHL%VEMKLQL%VEIKLQFYVDINLHCSGASTRPWSGARTRP%SGARTRPCSGASTR",(s/GGG/GGC/i,vec($_,0,32)&101058048)%63,1),/.../g' dna.txt

(Shell quoting, for Windows quoting swap ' and " characters). This version marks invalid codons with %, you can probably fix that by adding =~y/%//d at an appropriate spot.

Hint: This picks out 6 bits from the raw ASCII encoding of an RNA triple, giving 64 codes between 0 and 101058048; to get a string index, I reduce the result modulo 63, but this creates one double mapping which regrettably had to code two different proteins. The s/GGG/GGC/i maps one of them to another that codes the right protein.

Also note the parentheses before the % operator which both isolate the , operator from the argument list of substr and fix the precedence of & vs %. If you ever use that in production code, you're a bad, bad person.

OTHER TIPS

The only changes I would recommend making are simplifying your while loop:

while (<INPUT>) {
    tr/acgt/ACGT/;
    tr/GCTA/CGAU/;
    foreach my $protein (/(...)/g) {
        if (defined $proteins{$protein}) {
            print $proteins{$protein};
        }
    }
}

Since y and tr are synonyms, you should only use one of them. I think tr reads better than y, so I picked tr. Further, you were calling them very differently, but this should be the same effect and only mentions the letters you actually change. (All the other characters were being transposed to themselves. That makes it much harder to see what is actually being changed.)

You might want to remove the open(INPUT,"<dna.txt"); and corresponding close(INPUT); lines, as they make it much harder to use your program in shell pipelines or with different input files. But that's up to you, if the input file will always be dna.txt and never anything different, this is alright.

#!/usr/bin/perl
%p=qw/UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G/;
$_=uc<DATA>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
__DATA__
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT

Phew. Best I can come up with, at least this quickly. If you're sure the input is always already in uppercase, you can also drop the uc saving another two characters. Or if the input is always the same, you could assign it to $_ straight away instead of reading it from anywhere.

I guess I don't need to say that this code should not be used in production environments or anywhere else other than pure fun. When doing actual programming, readability almost always wins over compactness.

A few other versions I mentioned in the comments:

Reading %p and the DNA from files:

#!/usr/bin/perl
open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;
open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g

From shell with perl -e:

perl -e 'open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g'

Most things have already been pointed out, especially that readability matters. I wouldn't try to reduce the program more than what follows.

use strict;
use warnings;
# http://stackoverflow.com/questions/5402405/
my $fnprot = shift || 'proteins.txt';
my $fndna  = shift || 'dna.txt';
# build protein table
open my $fhprot, '<', $fnprot or die "open $fnprot: $!";
my %proteins = split /\s+/, do { local $/; <$fhprot> };
close $fhprot;
# process dna data
my @result;
open my $fhdna, '<', $fndna or die "open $fndna: $!";
while (<$fhdna>) {
    tr/acgt/ACGT/;
    tr/GCTA/CGAU/;
    push @result, map $proteins{$_}, grep defined $proteins{$_}, m/(...)/g;
}
close $fhdna;
# check correctness of result (given input as per original post)
my $expected = 'SIMQNISGREAT';
my $got = join '', @result;
die "@result is not expected" if $got ne $expected;
print "@result - $got\n";

The only "one-liner" thing I added is the push map grep m//g in the while loop. Note that Perl 5.10 adds the "defined or" operator - // - which allows you to write:

push @result, map $proteins{$_} // (), m/(...)/g;

Ah okay, the open do local $/ file slurp idiom is handy for slurping small files into memory. Hope you find it a bit inspiring. :-)

If write proteins data to another file, space delimited and without line break. So, you can import data by reading file once time.

#!/usr/bin/perl
use strict;
use warnings;      

open(INPUT, "<mydata.txt");
open(DATA, "<proteins.txt");
my %proteins = split(" ",<DATA>);

while (<INPUT>) {
    tr/GCTA/CGAU/;
    while(/(\w{3})/gi) {print $proteins{$1} if (exists($proteins{$1}))};
}
close(INPUT);
close(DATA);

You can remove line of code "tr/a,c,g,t/A,C,G,T/" because match operator has option for case insensitive (i option). And original foreach loop can be optimized like code above. $1 variable here is matched pattern result inside parentheses of match operation /(\w{3})/gi

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow