How to print lines from a file that have repeated more than six times [closed]

https://stackoverflow.com/questions/12300096

30-06-2021
|

Question

I have a file containing the data shown below. The first comma-delimited field may be repeated any number of times, and I want to print only the lines after the sixth repetition of any value of this field

For example, there are eight fields with 1111111 as the first field, and I want to print only the seventh and eighth of these records

Input file:

1111111,aaaaaaaa,14
1111111,bbbbbbbb,14
1111111,cccccccc,14
1111111,dddddddd,14
1111111,eeeeeeee,14
1111111,ffffffff,14
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,aaaaaaaa,14
2222222,bbbbbbbb,14
2222222,cccccccc,14
2222222,dddddddd,14
2222222,eeeeeeee,14
2222222,ffffffff,14
2222222,gggggggg,14
3333333,aaaaaaaa,14
3333333,bbbbbbbb,14
3333333,cccccccc,14
3333333,dddddddd,14
3333333,eeeeeeee,14
3333333,ffffffff,14
3333333,gggggggg,14
3333333,hhhhhhhh,14

Output:

1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14

What I have tried is to transponse the 2nd and 3rd fields with respect to 1st, so that I can use nawk on the field of $7 or $8

#!/usr/bin/ksh awk -F"," '{ a[$1]; b[$1]=b[$1]","$2 c[$1]=c[$1]","$3} END{ for(i in a){ print i","b[i]","c[i]} } ' file > output.txt

Solution

If your records are unordered

i.e. you may have "1111111" items distributed randomly throughout your input:

$ awk -F, '++a[$1] > 6' input.txt
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14

How does this work?

As you know, awk's -F option sets the delimiter. If it's not a special character, there's no pressing need to quote it.

Awk scripts consist of a series of blocks of condition { action; }. If the condition is missing, action is applied to every line. If the action is missing, it is implied to be print;. So an awk script that consists of simply a condition will print every input line for which that condition evaluates to true.

In this case, our condition also has elements of an action. That it, it increments elements of an associative array where the keys are your first field. The increment happens regardless of whether the condition evaluates to true. Also, putting ++ ahead rather than following the variable causes the increment to happen before the evaluation rather than after it. (I'm talking about the difference between ++var and var++.) And if the resultant incremented array element is greater than 6, the condition evaluates to true, causing the line to print.

This is functionally equivalent to the perl solutions in other answers, but because of the nature awk scripts is even tighter and (arguably) simpler. And of course, it's likely to be faster. (In my informal test just now, the awk script above executed more than twice as fast as an equivalent perl script from another answer, processing 250000 lines of input in 0.23s of user time vs 0.61 seconds in perl.)

If your records are ordered

i.e. all your "1111111" lines are together:

$ awk -F, '$1!=f{c=0;f=$1} ++c>6' input.txt
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14

How does this work?

If we're on a different $1 than last time (which is also true on the first line), we reset our counter and save $1 to a variable for future comparisons.
Then we increment the counter and print the line (implicitly) if the counter goes above 6.

This has the advantage of not eating memory with an array, but is only appropriate if your goal is to match sequential sets of lines with common $1 rather than handle matching lines that may be randomly distributed throughout your input.

OTHER TIPS

$ perl -F',' -ane 'print unless $seen{ $F[0] }++ < 6' file.txt

Explanation

-a enables autosplit mode, -F',' specifies ',' as the split token, resultant list stored in @F as a result
-n enables implicit line-by-line looping
-e executes following argument ('...' in this case) as Perl code
%seen keeps track of how many times the first field has been seen

Assuming the dots in your data should be commas, this Perl command will do what you ask

perl -aF, -ne 'print if ++$n{$F[0]} > 6' myfile

output

1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14

With awk (filter.awk):

BEGIN    { FS = "[,.]"          }   
o == $1  { cnt++                } 
o != $1  { o=$1; cnt = 0;       }
cnt >= 6 { print $0             }

To use:

awk -f filter.awk input_file

If you want to specifically have the awk solution then its here below:

awk -F, '{if(seen==$1){count++;}else{seen=$1;count=1}if(count>6)print }' file

tested below:

> awk -F, '{if(seen==$1){count++;}else{seen=$1;count=1}if(count>6)print }' temp
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14

If you need a perl script fo rthis then see below:

#!/usr/bin/perl

use strict;
use warnings;

my $count=0;
my $prev="";
open (MYFILE, 'temp');
while (<MYFILE>) {
       my @a=split(/,/);
       if($prev==$a[0])
       {
        $count++;
        if($count>6)
         {
           print "$_";       
          }
        }
        else
        {
        $prev=$a[0];
        $count=1; 
        }

 }
close (MYFILE);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow