Pergunta

I have data in the following sorted order (Here the data is sorted according to first v1, then v2, then v3 and then v4):

v1=1    v2=8513481      v3=119330184    v4=0
 v1=1    v2=8521383      v3=119330182    v4=0
 v1=1    v2=10630231     v3=60529116     v4=18
 v1=1    v2=60528877     v3=60529221     v4=17
 v1=1    v2=90351079     v3=90351078     v4=20
 v1=1    v2=271669588    v3=271669683    v4=101
 v1=2    v2=8513481      v3=10583646     v4=0
 v1=2    v2=10175437     v3=10175436     v4=0
 v1=2    v2=10630231     v3=60528947     v4=17
 v1=2    v2=10630231     v3=60529119     v4=18
 v1=2    v2=10630232     v3=605291191     v4=18

Now I want to find out the rows where v1's and v2's of 2 lines are equal. i.e. amongst the data given above I want to find rows of the following form:

 v1=2    v2=10630231     v3=60528947     v4=17
 v1=2    v2=10630231     v3=60529119     v4=18

I know how to do so in python by comparing the successive rows and whenever there is a match outputing the row. Is there an easy way to do the same using linux commands like sed, etc. I know how to use sed to find words when two values are given, but I dont know how to use sed in this context. A bit of explanation is highly appreciated.

Foi útil?

Solução

It will be a little easier with awk:

awk '{
    lines[$1,$2]=(lines[$1,$2]?lines[$1,$2] RS $0:$0)
    dups[$1,$2]++
}
END {
    for(line in lines) 
        if(dups[line]>1) print lines[line]
}' file
v1=2    v2=10630231     v3=60528947     v4=17
v1=2    v2=10630231     v3=60529119     v4=18
  • We create two arrays. lines and dups.
  • We increment the count when first and second column are seen more than once. We use dups array for this.
  • In our lines array, we check if we have stored a line with same first and second column. If we have we append the duplicate line to it.
  • In the END block we iterate over lines array. If the first and column are found more than once in our dups array, we print the lines.

Alternatively, if you don't want to keep entire file in memory, you can do the following (since you stated your data is already sorted):

awk '($1==c1 && $2==c2){print line RS $0}{line=$0;c1=$1;c2=$2}' file
  • We assign variables line as your entire current line, c1 as column 1 and c2 as column 2.
  • If column 1 and 2 of current line and same as previous column1 and column2, print previous line and current line.

Outras dicas

First let me start by saying that the list your are showing is not strictly sorted in the Linux sense (spaces and tabs do affect the sorting). The best Linux solution for your question is to use awk. Here is a command that should do what you are looking for:

awk -e '{cur=$1 " " $2; if (NR>1 && cur==prev) {print "line:"NR " " cur} prev=cur}' < input_file

All this does is compare the string formed by the combination of the first and second columns of the input file ($1 and $2; separated by a space for cleaner output) which we call cur with the same string from the previous input line which we call prev. If the two strings match we output the line number and the result. We also add a condition to skip the first line of the file since there would be nothing to compare yet.

This might work for you (GNU sed):

sed -rn '$!N;/^\s*(\S+)\s+(\S+)\s+.*\n\s*\1\s+\2/p;D' file

This uses back references to compare two lines and prints those lines that duplicate the first two values.

However if the duplicates may be three or more successive lines another approach can be used. Duplicates are printed and flagged using the hold buffer. When a duplicate followed by a non-duplicate line is encounter the last duplicate line is also printed and the flag reset:

sed -rn '$!N;/^\s*(\S+)\s+(\S+)\s+.*\n\s*\1\s+\2/{h;P;D};x;/./{z;x;P;D};x;D' file

One approach would be to find out how many characters are the same at the beginning of the line (looks like about 25?) and only compare that many via uniq:

uniq --check-chars=25 --repeated < input_file

To print both lines, use --all-repeated instead of --repeated.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top