Counting occurrences of a word in a file, counting the number of occurrences of any repeated word. Perl

StackOverflow https://stackoverflow.com/questions/22961594

문제

I use a regex to read a file line by line, then split this into scalar variables as below, regex works fine.

while (<GTFFILE>) {

        if ($_ =~ /(^\d)\s+\w+\s+(\w+)\s+(\d+)\s+(\d+)\s+\.\s+\W\s+\.\s+(\w+\_\w+\s+\"\w+\"\;)/){

    my $gene = $1;
    my $type = $2;
    my $start = $3;
    my $end = $4;
    my $geneId = $5;

Attempting to produce a hash from values taken from regex.

    $featurestart{$start} = $start;
    $featureend{$end} = $end;   

I need to find the length of the exons, using the hash I produced from the regex. This is done per line, but I am receiving error: Missing $ on loop variable. Any ideas?

            for each ($_) { 
            $exonlength = ($featureend{$_} - $featurestart{$_});
            printf ("Exon lengths: = %1.1f\n", $exonlength);
            }

Here I am clueless, I want to find the occurrences of a each word in the $geneId. HOw would I go about matching unknown words, and counting the distinct occurrences of each unknown word? I am guessing some sort of function to cluster together, perhaps in a hash/array the repeats of a word together then count each cluster somehow?

                    $geneCount{$geneId} = $type; 
                foreach $geneId { 

                }
        }   
    }
}

Each line of a GTF file is: 1 unknown exon 3204563 3207049 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15240"; transcript_id "NM_001011874.1"; tss_id "TSS13146"; This is what the regex is reading. The exon varies between lines, it can be exon, or cds etc, only one or the other per line, so counting the occurrences of the word exon, counts the number of exons on the file. The two numbers seperated by a space after 'exon' are coordinates, the exon length is to be calculated by subtracting the second number from the first. The phrases seperated by ';' are grouped as geneId. For these I want to count the occurrences of this section throughout the whole file, similarly to exon, this changes, however it is unknown what the string might be, so the idea is to find how many different strings in this variable occur.

도움이 되었습니까?

해결책

Okay, there are several issues. First allow me to put the obligatory link to https://metacpan.org/pod/Bio::Perl that i always itch to make when anyone mentions genes and parses files.

When you get to

if ($_ =~ /exon/)

$_ is still the whole line. So you check whether the current line has the string "exon" in it. So i assume you want to count occurrences of that string? Sadly tr/// will not do that for you. Instead it will replace every "e", "x", "o" or "n" with itself and count how often that has happened. So you count the characters, not the words "exon". If you insist on this clunky way of counting s/exon/exon/g instead of the tr/// stuff should work.

edit: Okay, sorry i had to interrupt writing.

For your error: What are you trying to loop over? If you mean

foreach ($_) {

then that does not make a lot of sense as $_ is only one element. And what is the length of an exon? I have no idea what an exon is at all. But i assume you meant to fill your hashes in some other way. As is, they have the same keys as values, so it does not really make a lot of sense to have them in the first place.

If you want help with anything after the counting, you certainly will have to give more information what your input looks like and what you are trying to do.

edit 2 After the question has been edited:

Okay, if that is what you want to do, you can do something like the following:

my $numberOfExon = 0;   # We will increase this whenever we meet an exon.
my @exonLength;         # This array will store all the exon lengths          
my %geneCount;          # This hash will store the counts per geneId

while (<GTFFILE>) {

    if ($_ =~ /(^\d)\s+\w+\s+(\w+)\s+(\d+)\s+(\d+)\s+\.\s+\W\s+\.\s+(\w+\_\w+\s+\"\w+\"\;)/){

        my $gene = $1;
        my $type = $2;
        my $start = $3;
        my $end = $4;
        my $geneId = $5;

        if ($_ =~ /exon/){

            $numberOfExon++;              # just count the lines that have exon in them
            my $length = $end - $start;   # just calculate the length

            push @exonLength, $length;    # Do with the length whatever you want

            $geneCount{$geneId}++;        # Increase the number of times this Id was seen
                                          # If this was the first time, a new field is created
        }
    }
}

print "Number of Exon: $numberOfExon \n";
print "Count of Ids:\n";
use Data::Dumper;
print Dumper(\%geneCount);

This counts only the Ids of exons, not of other whatevers. If you want the others, just put geneCount{geneId}++ after the first } (the if exon part).

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top