Perl, how to merge data with duplicate identifier and overlapping values into a hash

https://stackoverflow.com/questions/10170926

31-05-2021
|

Question

I was wondering if you could help me with a coding problem which I can't get my head around. The tab-delimited data I have looks like something like the following:

00001  AU:137  AU:150  AU:180
00001  AU:137  AU:170
00002  AU:180
00003  AU:147  AU:155
00003  AU:155

The output I want is:

00001  AU:137  AU:150  AU:180  AU:170
00002  AU:180
00003  AU:147  AU:155

So the first column (identifier) will merge the values, removing duplicates, so that it becomes a hash. I'm not sure how to work with my current data because it can't be a hash having duplicate keys. I'm also not sure how to push the data into an array if the identifier is the same.

I apologize for not having a code. I did try a few, actually, quite a lot, but they don't look right even to a newbie like myself.

Any help, suggestions would be greatly appreciated and thank you so much for your time and answer. I greatly appreciate it.

Solution

Script:

#!/usr/bin/perl

use strict;
use warnings;

my %hash;
sub uniq { return keys %{{map {$_=>1} @_}}; }

open my $fh, '<input.txt' or die $!;
foreach (<$fh>) {
  $hash{$1} .= $2 if /^(\S+)(\s.*?)[\n\r]*$/;
}
close $fh;

foreach (sort keys %hash) {
  my @elements = uniq split /\t/, $hash{$_};
  print "$_\t", join(' ', sort @elements), "\n";
}

Output:

00001    AU:137 AU:150 AU:170 AU:180
00002    AU:180
00003    AU:147 AU:155

OTHER TIPS

I hope this gives some idea to solve your problem:

use strict;
use warnings;
use Data::Dumper;

my %hash = ();

while (<DATA>) {
    chomp;
    my (@row) = split(/\s+/);
    my $firstkey = shift @row;

    foreach my $secondkey (@row) {
            $hash{$firstkey}{$secondkey}++;
    }
}

print Dumper \%hash;

__DATA__
00001  AU:137  AU:150  AU:180
00001  AU:137  AU:170
00002  AU:180
00003  AU:147  AU:155
00003  AU:15

The classical solution to this uses a hash; in fact a hash of hashes, as there are duplicate line numbers as well as duplicate values per line.

This program produces the output you need. It expects the data file to be passed on the command line.

use strict;
use warnings;

my %data;

while (<>) {
  chomp;
  my ($key, @items) = split /\t/;
  $data{$key}{$_}++ for @items;
}

print join("\t", $_, sort keys %{$data{$_}}), "\n" for sort keys %data;

output

00001 AU:137  AU:150  AU:170  AU:180
00002 AU:180
00003 AU:147  AU:155

Or if you prefer a command-line solution

perl -aF/\t/ -lne'$k=shift @F; $d{$k}{$_}++ for @F; END{print join "\t", $_, sort keys %{$d{$_}} for sort keys %d}' myfile

(It may need a little tweaking as I can only test on Windows at present.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow