Pergunta

I'm trying to read data from a fairly big file. I need to be able to read lines through the file and report on any duplicate records in the file beginning with a G.

THIS IS THE DATA:
E123456789
G123456789
h12345
E1234567
E7899874
G123456798
G123465798
h1245

This is example data as there are about 6000 lines of data muddled in amongst this. But this is the important data records beginning with E, G or h.

Here is my code so far:

#!/usr/bin/perl

use strict;
use warnings;

my $infile  = $ARGV[0];
my $found_E = 0;
my $sets    = 0;

open my $ifh, '<', $infile;
while (<$ifh>) {

  if (/^E/) {
    $found_E = 1;
    next;
  }

  if ($found_E) {

    if (/^G/) {
      $sets += 1;
      $found_E = 0;
      next;
    }

    if (/^h/) {
      print "Error! No G Record at line  $.\n";
      exit;
    }
  }
}
close($ifh);

printf "Found %d sets of Enrichment data with G Records \n", $sets;

my @lines;
my %duplicates;
open $ifh, '<', $infile;
while (<$ifh>) {
  @lines = split('', $_);
  if ($lines[0] eq 'G') {
    print if !defined $duplicates{$_};
    $duplicates{$_}++;
  }
}
close($ifh);

As you can see I'm checking that G occurs only after E records and before h records. The second loop is intended to find duplicates, but right now it just prints all G records.

Also if someone could advise what to do about reporting if there are no E records in the file that would be appreciated.

Foi útil?

Solução

Grouped Duplicate Checking

If you just want to check for duplicates which are grouped together, that's easy. You can just check if the current line is the same as the last line:

my $line;

while(<$ifh>) {
    next if (defined $line && $line eq $_);
    $line = $_;
    ...

All Duplicate Checking

If you want to check for all duplicate lines in the file, regardless of their positioning, you'll have to do something like this:

my %seen;

while (<$ifh>) {
   next if exists $seen{$_};
   $seen{$_} = 1;
   ...

This won't be fast on a large file as hash lookups are pretty poor, but it's the best option if you don't want to modify the source file.

Outras dicas

my %seen_G;
LINE:
while(<$ifh>)
{
    my $c  = substr( $_, 0, 1 );
    if ( $found_E ) { 
        die "Error! No G Record at line  $." if $c eq 'h';
        print if ( $c eq 'G' and not $seen_G{ $_ }++ );
    }
    $found_E = ( $c eq 'E' );
}

It's not clear whether you want to skip lines that are duplicates of the previous line or lines that are duplicate of any earlier line.

Skip lines that are duplicate of the previous line

Just fetch another line if the next line is the same as the last.

my $last;
while (<>) {
   next if /^G/ && defined($last) && $_ eq $last;
   $last = $_;
   ...
}

I'll leave it to you to determine when you actually want to look for duplicates, but I think you want to add a $found_G check to that if.

Skip lines that are duplicate of any previous line

Maintain a collection of the lines you've already seen. Using a hash will allow for quick insertion and lookup.

my %seen;
while (<>) {
   next if /^G/ && $seen{$_}++;
   ...
}
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top