Domanda

I have a series of tab-delimited files (up to 16 of them). Each one looks something like:

gi|100816391|ref|NM_003934.1|   1   162 192

gi|104485445|ref|NM_138572.2|   7   2316    2376

gi|105554499|ref|NR_002791.2|   1   2792    2867

Each file could contain as many as 20 million lines. Some of these lines will be unique; some of them will be repeated many times. What I need to do is to create a table that lists each unique line as well as how frequently that line occurs in each of the files. Output would ideally look something like:

"Gene Name" \t "Read start" \t "alignstart" \t "alignend" \t "freq in file1" \t "freq in file2" \t etc.

gi|100816391|ref|NM_003934.1| \t 1 \t 162 \t 192 \t 10000 \t 200

gi|104485445|ref|NM_138572.2| \t 7 \t 2316 \t 2376 \t 2 \t 500

Etc.

I am relatively new at programming and am trying to get up to speed as fast as possible, focusing on perl. I haven't yet seen any posts that are close enough to what I'm doing that I think I can modify them, but am happy to take suggestions if you think this has been addressed before.

È stato utile?

Soluzione

Assuming the content of the files are (2 files here):

my %files = (
    file1 => [
        'gi|100816391|ref|NM_003934.1|   1   162 192',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|100816391|ref|NM_003934.1|   1   162 192',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
    ],
    file2 => [
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
    ]
);

Piece of script:

my %data;
# Here you have to loop on all your files
# and do open ... while() ... instead of this foreach loop
foreach my $file (keys %files) {
    foreach (@{$files{$file}}) {
        $data{$_}{$file}++;
    }
}
foreach my $data (keys(%data)) {
    my $freq = $data;
    foreach my $file (sort keys %files) {
        $freq .= "\t$file:" . (exists$data{$data}{$file} ? $data{$data}{$file} : 0);
    }
    print $freq,"\n";
}

output:

gi|105554499|ref|NR_002791.2|   1   2792    2867    file1:1 file2:2
gi|100816391|ref|NM_003934.1|   1   162 192 file1:2 file2:0
gi|104485445|ref|NM_138572.2|   7   2316    2376    file1:2 file2:2

Altri suggerimenti

Have a try with this sort of thing to get you going:

File1:

gi|100816391|ref|NM_003934.1|   1       162     192
gi|104485445|ref|NM_138572.2|   7       2316    2376
gi|105554499|ref|NR_002791.2|   1       2792    2867

File2:

gi|100816391|ref|NM_003934.1|   1       162     192 # The same as in file file
gi|104485445|ref|NM_111111.2|   7       2316    2376 # Different from file 1
gi|105554499|ref|NR_222222.2|   1       2792    2867 # Different from file 1

Code:

#!/usr/bin/perl
use warnings;
use strict; 

open my $input, '<', 'in.txt';

my (%file1, %seen);
while (<$input>){
    chomp;
    my @split = split(/\t/);
    $file1{$split[0]} = $_;
    $seen{$_}++; # Count each time you see an identical line in file
}

open my $input2, '<', 'in.2.txt';

my %file2;
while (<$input2>){
    chomp;
    my @split = split(/\t/);
    $file1{$split[0]} = $_;
    $seen{$_}++; 
}


foreach my $key (keys %seen){
    print "$key\tfreq: $seen{$key}\n"; # Print out all lines with their frequency of occurrence
}

Output:

gi|105554499|ref|NR_222222.2|   1   2792    2867    freq: 1
gi|100816391|ref|NM_003934.1|   1   162 192 freq: 2
gi|105554499|ref|NR_002791.2|   1   2792    2867    freq: 1
gi|104485445|ref|NM_111111.2|   7   2316    2376    freq: 1
gi|104485445|ref|NM_138572.2|   7   2316    2376    freq: 1

You can do this with awk:

awk '{a[$0]++}END{for (i in a){print i,a[i]}}' yourfile

As each line is encountered, the element of array a[] indexed by the line is incremented to count that occurrence of this line. Then at the end, the keys of a[] are printed and the contents.

So, after the first line, array a[] will look like this:

a["gi|100816391|ref|NM_003934.1|   1   162 192"]=1

after the second line, array a[] will look like this:

a["gi|104485445|ref|NM_138572.2|   7   2316    2376"]=1

If you have 16 to do, put the above in a loop:

#!/usr/bin/bash
for f in *.csv
do
  echo Processing file "$f"
  awk '{a[$0]++}END{for (i in a){print i,a[i]}}' "$f"
done

The answer by M42 was the one I most readily understood and could modify; I will let people with actual programming experience say whether this is actually the best approach or not. In any event, I modified his program slightly to fit my situation. The final program that worked was:

$sourcefolder = "/home/guests/etc";
$destfolder = "/home/guests/etc";
$sourceextension = "fwd"; #the extension of the files I want to change



my %data;

opendir DIR, ($sourcefolder) || die "Cannot open directory $!";
while($filename = readdir(DIR) )
{
        if($filename =~ /.*.$sourceextension/){ 
            print "Now processing: $filename\n";
            $sample = (split /\./, $filename)[0]; #this is to get rid of the extension on the source files
            $outfile=("combine_sum-out");
            push (@samples, $sample);

    if (! (open (IN, "<$sourcefolder/$filename"))) { die "Can't open $filename: $!\n"; }
    if (! (open (OUT, ">>$destfolder/$outfile"))) { die "Can't write to $outfile: $!\n"; }}



    while(chomp($line=<IN>))
    {
            $data{$line}{$sample}++; #creates the hash of a hash
    }
}

foreach my $data (keys(%data)) {
           my $freq = $data;
           foreach my $sa (@samples) {
               $freq .= "\t$sa:" . (exists$data{$data}{$sa} ? $data{$data}{$sa} : 0);
           }
           print OUT ($freq,"\n");
}

I may eventually modify the last block so that only the values from $data{$data}{$sa} are printed, and the original $data is printed as a header row at the beginning.

Thanks to all for the help!

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top