Question

I have a tab-delimited text file like this:

contig11 GO:100 other columns of data
contig11 GO:289 other columns of data
contig11 GO:113 other columns of data
contig22 GO:388 other columns of data
contig22 GO:101 other columns of data

And another like this:

contig11 3 N
contig11 1 Y
contig22 1 Y
contig22 2 N

I need to combine them so that each 'multiple' entry of one of the files is duplicated and populated with its data in the other, so that I get:

contig11 3 N GO:100 other columns of data
contig11 3 N GO:289 other columns of data
contig11 3 N GO:113 other columns of data
contig11 1 Y GO:100 other columns of data
contig11 1 Y GO:289 other columns of data
contig11 1 Y GO:113 other columns of data
contig22 1 Y GO:388 other columns of data
contig22 1 Y GO:101 other columns of data
contig22 2 N GO:388 other columns of data
contig22 2 N GO:101 other columns of data

I have little scripting experience, but have done this where e.g. "contig11" occurs only once in one of the files, with hashes/keys. But I can't even begin to get my head around to do this! Really appreciate some help or hints as to how to tackle this problem.

EDIT So I have tried ikegami's suggestion (see answers) with this: However, this has produced the output I needed except the GO:100 column onwards ($rest in script???) - any ideas what I'm doing wrong?

#!/usr/bin/env/perl

use warnings;

open (GOTERMS, "$ARGV[0]") or die "Error opening the input file with GO terms";
open (SNPS, "$ARGV[1]") or die "Error opening the input file with SNPs";

my %goterm;

while (<GOTERMS>)
{
    my($id, $rest) = /^(\S++)(,*)/s;
    push @{$goterm{$id}}, $rest;
}

while (my $row2 = <SNPS>)
{
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} })
    {
        print("$row2$rest\n");
    }
}

close GOTERMS;
close SNPS;
Was it helpful?

Solution

Look at your output. It's clearly produced by

  • for each row of the second file,
    • for each row of the first file with the same id,
      • print out the combined rows

So the question is: How does you find the rows of the first file with the same id as a row of the second file?

The answer is: You store the rows of the first file in a hash indexed by the row's id.

my %file1;
while (<$file1_fh>) {
   my ($id, $rest) = /^(\S++)(.*)/s;
   push @{ $file1{$id} }, $rest;
}

So the earlier pseudo code resolves to

while (my $row2 = <$file2_fh>) {
   chomp($row2);
   my ($id) = $row2 =~ /^(\S+)/;
   for my $rest (@{ $file1{$id} }) {
      print("$row2$rest");
   }
}

#!/usr/bin/env perl

use strict;   
use warnings;

open(my $GOTERMS, $ARGV[0])
     or die("Error opening GO terms file \"$ARGV[0]\": $!\n");
open(my $SNPS, $ARGV[1])
     or die("Error opening SNP file \"$ARGV[1]\": $!\n");

my %goterm;
while (<$GOTERMS>) {
    my ($id, $rest) = /^(\S++)(.*)/s;
    push @{ $goterm{$id} }, $rest;
}

while (my $row2 = <$SNPS>) {
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} }) {
        print("$row2$rest");
    }
}

OTHER TIPS

I will describe how you can do this. You need each file pu to array (each libe is an array item). Then you just need to compare these array in needed way. You need 2 loops. Main loops for each record of array/file which contains string which you you will use to campare (in your example it will be 2nd file). Under this loop you need to have another loop for each record in a array/file with records which you will compare with. And just check each record of array with the each recrod of another array and process results.

foreach my $record2 (@array2) {
    foreach my $record1 (@array1){
        if ($record2->{field} eq $record1->{field}){
            #here you need to create the string which you will show
            my $res_string = $record2->{field}.$record1->{field};
            print "$res_string\n";
        }
    }
}

Or dont use array. Just read files and compare each line with each line of another file. General idea is the same ))

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top