In Windows, how to compare two files and return just the records missing in the second file that were originally present in the first file?

https://stackoverflow.com/questions/827358

05-07-2019
|

Question

At regular intervals we are receiving CSV files from an external source that we have little control over. These files are complete sets of current records; however, any records that have been deleted since the previous are not present. We would like to compare the two files and create a separate file of deleted records so we can do some additional processing on them. In an application in another area we have a commercial sort package (CoSort) that does this out of the box; however, we don't have access to that here. The volumes aren't that large, though, and it seems like this is something that standard or free tools might be able to handle quite easily. Ideally this would take the form of a Windows batch file, but Perl or awk solutions would be okay too. Example input files:

Previous File:

X_KEY,X_NAME,X_ATTRIBUTE
123,Name 123,ATT X
111,Name 111,ATT X
777,Name 777,ATT Y

Incoming File:

X_KEY,X_NAME,X_ATTRIBUTE
777,Name 777,ATT Y
123,Name 123,ATT CHANGED

Resulting File should be at a minimum:

111,Name 111

But if the attributes from the deleted records come through too, that is fine.

So far I have a batch file that uses freeware CMSort to sort the two files minus the header record to make it easier for some type of diff process:

REM Sort Previous File, Skip Header

C:\Software\CMSort\cmsort.exe /H=1 x_previous.txt x_previous_sorted.txt

REM Sort Incoming File, Skip Header

G:\Software\CMSort\cmsort.exe /H=1 x_incoming.txt x_incoming_sorted.txt

But the 'compare and show only the missing records from the first file' bit is eluding me. Part of the complexity is numerous attributes can change among the records that are left, so it isn't a pure diff. However, it feels like a specialized diff command--one that is limited to checking just the key field, not the entire record. I can't seem to get the syntax correct, though. Ideas? Record counts shouldn't exceed 50k records.

Note: If this were SQL and the data were sitting in tables, we could use the EXCEPT operator but moving the data to the database in this case is not an option.

Solution

if i were to do this in perl I'd just use a couple of hashes,


#!/usr/bin/perl -w                                                                                                           
use strict;
use warnings;

my %orig; my %new; my %changed;

open(F1,"<$ARGV[0]")||die"Couldn't open $ARGV[0]: $!\n";
while(<F1>){
    chomp;
    @_ = split(/,/);
    $orig{$_[0]} = $_;
}
close(F1);

open(F2,"<$ARGV[1]")||die"Couldn't open $ARGV[1]: $!\n";
while(<F2>){
    chomp;
    @_ = split(/,/);
    if($orig{$_[0]}){
        if($orig{$_[0]} ne $_){
            $changed{$_[0]} = $orig{$_[0]}."||".$_;
        }
        delete $orig{$_[0]};
    }else{
        $new{$_[0]} = $_;
    }
}
close(F2);

print "Deleted:\n";
print map{$orig{$_}."\n"} sort {$b<=>$a} keys %orig;
print "Added:\n";
print map{$new{$_}."\n"} sort {$b<=>$a} keys %new;
print "Changed:\n";
print map{$changed{$_}."\n"} sort {$b<=>$a} keys %changed;

assuming your text examples are in files f1.txt and f2.txt,


kettle$ ./compare.pl f1.txt f2.txt

Deleted:
111,Name 111,ATT X
Added:
Changed:
123,Name 123,ATT X||123,Name 123,ATT CHANGED

OTHER TIPS

diff a.input b.input | grep ^[\>]

http://gnuwin32.sourceforge.net/packages/diffutils.htm

http://gnuwin32.sourceforge.net/packages/grep.htm

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 2 or die "mycompare oldfile newfile\n";

my ($oldfile, $newfile) = @ARGV;

my %newrecords;

open my $new, '<', $newfile
    or die "Cannot open '$newfile':$!";

scalar <$new>; # skip header

while ( my $line = <$new> ) {
    next unless $line =~ /\S/;
    my ($record) = split /,/, $line;
    $newrecords{ $record } = 1;
}

close $new;

open my $old, '<', $oldfile
    or die "Cannot open '$oldfile': $!";

scalar <$old>; # skip header

while ( my $line = <$old> ) {
    next unless $line =~ /\S/;
    my ($record) = split /,/, $line;
    print $line unless exists $newrecords{ $record };
}

close $new;

__END__

C:Temp> mycompare old.csv new.csv
111,Name 111,ATT X

It would not be hard to write a small console application that went through the first file, parsed out the keys and checked the second file for the rows matching the keys and then creating the third file. I guess I'm saying this seems like a good case to roll your own. :) By the way, this is an O(mn) operation where m, n are size of file 1 and file 2, so it's probably not going to be uber speedy.

I would use diff, and if it's not available directly, use it through cygwin.

I worked on a faintly similar system once upon a long time ago (distressingly close to 20 years ago, in fact). The data was stored in a database, and the incoming data had to be compared with the previous week's data. There were 74 columns of data in the data, and some of them took the form of 'prev code, prev date' pairs (for several cycles). So, one legitimate set of changes was for:

Old:
name1:  value1        date1:  2000-01-01
name2:  value2        date2:  1995-05-31
name3:  value3        date3:  1990-10-13

New:
name1:  New Name      date1:  2009-03-31
name2:  Other Name    date2:  2005-04-12
name3:  value1        date3:  2000-01-01

In this scenario, what was 'name1, date1' has become 'name3, date3'; there appears to have been an intermediate name change at a date some time in the past, and a recent name change. The intelligence about name changes could be delayed. Working out what was common with what went before and what was brand new was tough. Indeed, the whole process was tricky.

I actually developed a bilingual self-regenerating program to manage this stuff. The top part of the source code took advantage of the comment notations in the primary language (Informix 4GL) which used both shell-style #...eol comments and {...} comments (the latter able to spread over multiple lines. Of course, this is a shell construct for I/O redirection, too. So, I had the shell script that would generate the I4GL code embedded in a { ... } comment, and the code was generated from a table defining the 70+ columns and how each one needed to be handled. This saved a lot of typing over time. h

There's also Algorithm::Diff available from CPAN.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow