Question

I have two data files 1.txt and 2.txt

1.txt contains valid lines.

For example.

1 2 1 2 
1 3 1 3

In 2.txt i have an extra coloum, but if you ignore that, I have a few valid lines, and few invalid lines. There could be multiple occurrences of the same line in 2.txt

For example:

1 2 1 2 1.9
1 3 1 3 3.4
1 3 1 3 3.4
2 3 2 3 5.6
2 3 2 3 5.6

The second and third lines are the same and valid.

The fourth and fifth lines are also the same but invalid.

I want to write a shell script which compares these two files and outputs two files, valid.txt and invalid.txt which look like these...

valid.txt :

1 2 1 2 1
1 3 1 3 2

and invalid.txt :

2 3 2 3 2

The last extra column of valid.txt and invalid.txt contains the number of times the line has been repeated in 2.txt.

Was it helpful?

Solution

this awk script works for the example data:

 awk 'NR==FNR{sub(/ *$/,"");a[$0]++;next}
        {sub(/ [^ ]*$/,"")
         if($0 in a)
                 v[$0]++
         else 
                 n[$0]++
        }
        END{
            for(x in v)print x,v[x] > "valid.txt"
            for(x in n) print x,n[x] >"inv.txt"
        }' file1 file2

output:

kent$  head inv.txt valid.txt
==> inv.txt <==
2 3 2 3 2

==> valid.txt <==
1 3 1 3 2
1 2 1 2 1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top