Question

I basically want to do an out-of-order diff between two text files (in CSV style) where I compare the fields in the first two columns (I don't care about the 3rd columns value). I then print out the values that file1.txt has but aren't present in file2.txt and vice-versa for file2.txt compared to file1.txt.

file1.txt:

cat,val 1,43432
cat,val 2,4342
dog,value,23
cat2,value,2222
hedgehog,input,233

file2.txt:

cat2,value,312
cat,val 2,11
cat,val 3,22
dog,value,23
hedgehog,input,2145
bird,output,9999

Output would be something like this:

file1.txt:
cat,val 1,43432

file2.txt:
cat,val 3,22
bird,output,9999

I'm new to Perl so some of the better, less ugly methods to achieve this are outside of my knowledge currently. Thanks for any help.

current code:

#!/usr/bin/perl -w

use Cwd;
use strict;
use Data::Dumper;
use Getopt::Long;

my $myName = 'MyDiff.pl';
my $usage = "$myName is blah blah blah";

#retreive the command line options, set up the environment
 use vars qw($file1 $file2);

#grab the specified values or exit program
GetOptions("file1=s" => \$file1,
        "file2=s" => \$file2) 
        or die $usage;
 ( $file1 and $file2 ) or die $usage;

 open (FH, "< $file1") or die "Can't open $file1 for read: $!";
 my @array1 = <FH>;
 close FH or die "Cannot close $file1: $!"; 
 open (FH, "< $file2") or die "Can't open $file2 for read: $!";
 my @array2 = <FH>;
 close FH or die "Cannot close $file2: $!"; 

 #...do a sort and match
Was it helpful?

Solution 2

Perhaps the following will be helpful:

use strict;
use warnings;

my @files = @ARGV;
pop;
my %file1 = map { chomp; /(.+),/; $1 => $_ } <>;

push @ARGV, $files[1];
my %file2 = map { chomp; /(.+),/; $1 => $_ } <>;

print "$files[0]:\n";
print $file1{$_}, "\n" for grep !exists $file2{$_}, keys %file1;

print "\n$files[1]:\n";
print $file2{$_}, "\n" for grep !exists $file1{$_}, keys %file2;

Usage: perl script.pl file1.txt file2.txt

Output on your datasets:

file1.txt:
cat,val 1,43432

file2.txt:
cat,val 3,22
bird,output,9999

This builds a hash for each file. The keys are the first two columns and the associated values are the full lines. grep is used to filter the shared keys.

Edit: On relatively smaller files, using map as above to process the file's lines will work fine. However, a list of all of the file's lines is first created, and then passed to map. On larger files, it may be better to use a while (<>) { ... construct, to read one line at a time. The code below does this--generating the same output as above--and uses a hash of hashes (HoH). Because it uses a HoH, you'll note some dereferencing:

use strict;
use warnings;

my %hash;
my @files = @ARGV;

while (<>) {
    chomp;
    $hash{$ARGV}{$1} = $_ if /(.+),/;
}

print "$files[0]:\n";
print $hash{ $files[0] }{$_}, "\n"
  for grep !exists $hash{ $files[1] }{$_}, keys %{ $hash{ $files[0] } };

print "\n$files[1]:\n";
print $hash{ $files[1] }{$_}, "\n"
  for grep !exists $hash{ $files[0] }{$_}, keys %{ $hash{ $files[1] } };

OTHER TIPS

Use a Hash for this with first 2 columns as key. once you have these two hashes you can iterate and delete the common entries, what remains in respective hashes will be what you are looking for.

Initialize,

my %hash1 = ();
my %hash2 = ();

Read in first file, join first two columns to form key and save it in hash. This assumes fields are comma separated. You could use a CSV module also for the same.

open( my $fh1, "<", $file1 ) || die "Can't open $file1: $!";
while(my $line = <$fh1>) {
    chomp $line;

    # join first two columns for key
    my $key = join ",", (split ",", $line)[0,1];

    # create hash entry for file1
    $hash1{$key} = $line;
}

Do the same for file2 and create %hash2

open( my $fh2, "<", $file2 ) || die "Can't open $file2: $!";
while(my $line = <$fh2>) {
    chomp $line;

    # join first two columns for key
    my $key = join ",", (split ",", $line)[0,1];

    # create hash entry for file2
    $hash2{$key} = $line;
}

Now go over the entries and delete the common ones,

foreach my $key (keys %hash1) {
    if (exists $hash2{$key}) {
        # common entry, delete from both hashes
        delete $hash1{$key};
        delete $hash2{$key};
    }
}

%hash1 will now have lines which are only in file1.

You could print them as,

foreach my $key (keys %hash1) {
    print "$hash1{$key}\n";
}

foreach my $key (keys %hash2) {
    print "$hash2{$key}\n";
}

I think the above prob can be solved by either of the mentioned algo

a) We can use the hash as mentioned above

b) 1. Sort both the files with Key1 and Key2 (use sort fun)

Iterate through FILE1

  Match the key1 and key2 entry of FILE1 with FILE2
      If yes then
        take action by printing common lines it to desired file as required
        Move to next row in File1 (continue with the loop )
      If No then
        Iterate through File2 startign from the POS-FILE2 until match is found
            Match the key1 and key2 entry of FILE1 with FILE2
            If yes then
              take action by printing common lines it to desired file as required
              setting FILE2-END as true
              exit from the loop noting the position of FILE2
            If no then
              take action by printing unmatched lines to desired file as req.
              Move to next row in File2
  If FILE2-END is true
     Rest of Lines in FILE1 doesnt exist in FILE2
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top