Perl - Prevent duplicate by checking if pattern already exist in opened file before writing

https://stackoverflow.com/questions/21842367

13-10-2022
|

Question

I have a perl script that manage conversion of a specific file format into csv files i can manage later.

I need this script to be able to prevent generating duplicated lines:

  #get timetamp
  if ((rindex $l,"ZZZZ,") > -1) {
          (my $t1, my $t2, my $timestamptmp1, my $timestamptmp2) = split(",",$l);
          $timestamp = $timestamptmp2." ".$timestamptmp1;
  }

  if (((rindex $l,"TOP,") > -1) && (length($timestamp) > 0)) {
    (my @top) = split(",",$l);
        my $aecrire = $SerialNumber.",".$hostnameT.",".$timestamp.",".$virtual_cpus.",".$logical_cpus.",".$smt_threads.",".$top[1];
        my $i = 3;###########################################################################
        while ($i <= $#top) {
      $aecrire = $aecrire.','.$top[$i];
          $i = $i + 1;
        }
        print (FIC2 $aecrire."\n");
  }

My source file is FIC1 and destination file FIC2, the uniq key is $timestamp.

I want the script to check if $timestamp already exist in FIC1 (which is opened at begin of process), and if it does exclude the line from being writing to FIC2. if $timestamp is not present, then write as normal.

Currently if a rerun the script over an already proceeded file, each line will be sorted by the timestamp and duplicated.

My goal is to be able to run this script periodically over a file without duplicating events.

I'm quite new to perl, as far as i've seen this should be achieve simply using the %seen variable within the while, but i could not yet achieve it successfully...

Thank you very much in advance for any help :-)

Solution

What you are describing is a hash.

You would define a hash in your code

my %seen = ();

Then when you read a line - before you decide to write it you could do something like:

#Check the hash to see if we have seen this line before we write it out

if ($seen{$aecrire} eq 1) {
 #Do nothing - skip the line
} else {
 $seen{$aecrire} = 1;  
 print (FIC2 $aecrire."\n"); 
}

I haven't checked this code but that is the jist.

OTHER TIPS

I ended by adding the following code at the end of my process:

my (@final, %hash, $file) = ((), (), "");

foreach $file ($dstfile_CPU_ALL, $dstfile_MEM, $dstfile_VM, $dstfile_PROC, $dstfile_TOP ) {

        if (!open FILE, "+<$file") {
                print "Nothing to dedup, '$file' $!\n";
                next;
        }

        while (<FILE>) {
                if (not exists $hash{$_}) {
                        push @final, $_;
                        $hash{$_} = 1;
                }
        }

        truncate FILE, 0;
        seek FILE, 0, 0;
        print FILE @final;
        close FILE;
        %hash = @final = ();
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow