Frage

I have a series of tab-delimited files (up to 16 of them). Each one looks something like:

gi|100816391|ref|NM_003934.1|   1   162 192

gi|104485445|ref|NM_138572.2|   7   2316    2376

gi|105554499|ref|NR_002791.2|   1   2792    2867

Each file could contain as many as 20 million lines. Some of these lines will be unique; some of them will be repeated many times. What I need to do is to create a table that lists each unique line as well as how frequently that line occurs in each of the files. Output would ideally look something like:

"Gene Name" \t "Read start" \t "alignstart" \t "alignend" \t "freq in file1" \t "freq in file2" \t etc.

gi|100816391|ref|NM_003934.1| \t 1 \t 162 \t 192 \t 10000 \t 200

gi|104485445|ref|NM_138572.2| \t 7 \t 2316 \t 2376 \t 2 \t 500

Etc.

I am relatively new at programming and am trying to get up to speed as fast as possible, focusing on perl. I haven't yet seen any posts that are close enough to what I'm doing that I think I can modify them, but am happy to take suggestions if you think this has been addressed before.

War es hilfreich?

Lösung

Assuming the content of the files are (2 files here):

my %files = (
    file1 => [
        'gi|100816391|ref|NM_003934.1|   1   162 192',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|100816391|ref|NM_003934.1|   1   162 192',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
    ],
    file2 => [
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
    ]
);

Piece of script:

my %data;
# Here you have to loop on all your files
# and do open ... while() ... instead of this foreach loop
foreach my $file (keys %files) {
    foreach (@{$files{$file}}) {
        $data{$_}{$file}++;
    }
}
foreach my $data (keys(%data)) {
    my $freq = $data;
    foreach my $file (sort keys %files) {
        $freq .= "\t$file:" . (exists$data{$data}{$file} ? $data{$data}{$file} : 0);
    }
    print $freq,"\n";
}

output:

gi|105554499|ref|NR_002791.2|   1   2792    2867    file1:1 file2:2
gi|100816391|ref|NM_003934.1|   1   162 192 file1:2 file2:0
gi|104485445|ref|NM_138572.2|   7   2316    2376    file1:2 file2:2

Andere Tipps

Have a try with this sort of thing to get you going:

File1:

gi|100816391|ref|NM_003934.1|   1       162     192
gi|104485445|ref|NM_138572.2|   7       2316    2376
gi|105554499|ref|NR_002791.2|   1       2792    2867

File2:

gi|100816391|ref|NM_003934.1|   1       162     192 # The same as in file file
gi|104485445|ref|NM_111111.2|   7       2316    2376 # Different from file 1
gi|105554499|ref|NR_222222.2|   1       2792    2867 # Different from file 1

Code:

#!/usr/bin/perl
use warnings;
use strict; 

open my $input, '<', 'in.txt';

my (%file1, %seen);
while (<$input>){
    chomp;
    my @split = split(/\t/);
    $file1{$split[0]} = $_;
    $seen{$_}++; # Count each time you see an identical line in file
}

open my $input2, '<', 'in.2.txt';

my %file2;
while (<$input2>){
    chomp;
    my @split = split(/\t/);
    $file1{$split[0]} = $_;
    $seen{$_}++; 
}


foreach my $key (keys %seen){
    print "$key\tfreq: $seen{$key}\n"; # Print out all lines with their frequency of occurrence
}

Output:

gi|105554499|ref|NR_222222.2|   1   2792    2867    freq: 1
gi|100816391|ref|NM_003934.1|   1   162 192 freq: 2
gi|105554499|ref|NR_002791.2|   1   2792    2867    freq: 1
gi|104485445|ref|NM_111111.2|   7   2316    2376    freq: 1
gi|104485445|ref|NM_138572.2|   7   2316    2376    freq: 1

You can do this with awk:

awk '{a[$0]++}END{for (i in a){print i,a[i]}}' yourfile

As each line is encountered, the element of array a[] indexed by the line is incremented to count that occurrence of this line. Then at the end, the keys of a[] are printed and the contents.

So, after the first line, array a[] will look like this:

a["gi|100816391|ref|NM_003934.1|   1   162 192"]=1

after the second line, array a[] will look like this:

a["gi|104485445|ref|NM_138572.2|   7   2316    2376"]=1

If you have 16 to do, put the above in a loop:

#!/usr/bin/bash
for f in *.csv
do
  echo Processing file "$f"
  awk '{a[$0]++}END{for (i in a){print i,a[i]}}' "$f"
done

The answer by M42 was the one I most readily understood and could modify; I will let people with actual programming experience say whether this is actually the best approach or not. In any event, I modified his program slightly to fit my situation. The final program that worked was:

$sourcefolder = "/home/guests/etc";
$destfolder = "/home/guests/etc";
$sourceextension = "fwd"; #the extension of the files I want to change



my %data;

opendir DIR, ($sourcefolder) || die "Cannot open directory $!";
while($filename = readdir(DIR) )
{
        if($filename =~ /.*.$sourceextension/){ 
            print "Now processing: $filename\n";
            $sample = (split /\./, $filename)[0]; #this is to get rid of the extension on the source files
            $outfile=("combine_sum-out");
            push (@samples, $sample);

    if (! (open (IN, "<$sourcefolder/$filename"))) { die "Can't open $filename: $!\n"; }
    if (! (open (OUT, ">>$destfolder/$outfile"))) { die "Can't write to $outfile: $!\n"; }}



    while(chomp($line=<IN>))
    {
            $data{$line}{$sample}++; #creates the hash of a hash
    }
}

foreach my $data (keys(%data)) {
           my $freq = $data;
           foreach my $sa (@samples) {
               $freq .= "\t$sa:" . (exists$data{$data}{$sa} ? $data{$data}{$sa} : 0);
           }
           print OUT ($freq,"\n");
}

I may eventually modify the last block so that only the values from $data{$data}{$sa} are printed, and the original $data is printed as a header row at the beginning.

Thanks to all for the help!

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top