Counting occurrences of a word in a file, counting the number of occurrences of any repeated word. Perl

Question

Okay, there are several issues. First allow me to put the obligatory link to https://metacpan.org/pod/Bio::Perl that i always itch to make when anyone mentions genes and parses files.

When you get to

if ($_ =~ /exon/)

$_ is still the whole line. So you check whether the current line has the string "exon" in it. So i assume you want to count occurrences of that string? Sadly tr/// will not do that for you. Instead it will replace every "e", "x", "o" or "n" with itself and count how often that has happened. So you count the characters, not the words "exon". If you insist on this clunky way of counting s/exon/exon/g instead of the tr/// stuff should work.

edit: Okay, sorry i had to interrupt writing.

For your error: What are you trying to loop over? If you mean

foreach ($_) {

then that does not make a lot of sense as $_ is only one element. And what is the length of an exon? I have no idea what an exon is at all. But i assume you meant to fill your hashes in some other way. As is, they have the same keys as values, so it does not really make a lot of sense to have them in the first place.

If you want help with anything after the counting, you certainly will have to give more information what your input looks like and what you are trying to do.

edit 2 After the question has been edited:

Okay, if that is what you want to do, you can do something like the following:

my $numberOfExon = 0;   # We will increase this whenever we meet an exon.
my @exonLength;         # This array will store all the exon lengths          
my %geneCount;          # This hash will store the counts per geneId

while (<GTFFILE>) {

    if ($_ =~ /(^\d)\s+\w+\s+(\w+)\s+(\d+)\s+(\d+)\s+\.\s+\W\s+\.\s+(\w+\_\w+\s+\"\w+\"\;)/){

        my $gene = $1;
        my $type = $2;
        my $start = $3;
        my $end = $4;
        my $geneId = $5;

        if ($_ =~ /exon/){

            $numberOfExon++;              # just count the lines that have exon in them
            my $length = $end - $start;   # just calculate the length

            push @exonLength, $length;    # Do with the length whatever you want

            $geneCount{$geneId}++;        # Increase the number of times this Id was seen
                                          # If this was the first time, a new field is created
        }
    }
}

print "Number of Exon: $numberOfExon \n";
print "Count of Ids:\n";
use Data::Dumper;
print Dumper(\%geneCount);

This counts only the Ids of exons, not of other whatevers. If you want the others, just put geneCount{geneId}++ after the first } (the if exon part).