Question

I'm trying to compare the content of two files and tell if the content of one is totally included in another (meaning if one file has three lines, A, B and C, can I find those three lines, in that order, in the second file). I've looked at diff and grep but wasn't able to find the relevant option (if any).

Examples:

file1.txt   file2.txt  <= should return true (file2 is included in file1)
---------   ---------
abc         def
def         ghi
ghi
jkl    

file1.txt   file2.txt  <= should return false (file2 is not included in file1)
---------   ---------
abc         abc
def         ghi
ghi
jkl    

Any idea?

Was it helpful?

Solution

Using the answer from here

Use the following python function:

def sublistExists(list1, list2):
    return ''.join(map(str, list2)) in ''.join(map(str, list1))

In action:

In [35]: a=[i.strip() for i in open("f1")]
In [36]: b=[i.strip() for i in open("f2")]
In [37]: c=[i.strip() for i in open("f3")]

In [38]: a
Out[38]: ['abc', 'def', 'ghi', 'jkl']

In [39]: b
Out[39]: ['def', 'ghi']

In [40]: c
Out[40]: ['abc', 'ghi']

In [41]: sublistExists(a, b)
Out[41]: True

In [42]: sublistExists(a, c)
Out[42]: False

OTHER TIPS

Assuming your file2.txt does not contain characters with special meaning for regular expressions, you can use:

grep "$(<file2.txt)" file1.txt

This should work even if your file2.txt contains special characters:

cp file1.txt file_read.txt

while read -r a_line ; do
   first_line_found=$( fgrep -nx "${a_line}" file_read.txt 2>/dev/null | head -1 )
   if [ -z "$first_line_found" ]; 
   then 
        exit 1 # we couldn't find a_line in the file_read.txt
   else
        { echo "1,${first_line_found}d" ; echo "w" ; } | ed file_read.txt  #we delete up to line_found
   fi   
done < file2.txt
exit 0

(the "exit 0" is there for "readability" so one can see easily that it exits with 1 only if fgrep can't find a line in file1.txt. It's not needed)

(fgrep is a literral grep, searching for a string (not a regexp))

(I haven't tested the above, it's a general idea. I hope it does work though ^^)

the "-x" force it to match lines exactly, ie, no additionnal characters (ie : "to" can no longer match "toto". Only "toto" will match "toto" when adding -x)

please try if this awk "one-liner" ^_^ works for your real file. for the example files in your question, it worked:

awk 'FNR==NR{a=a $0;next}{b=b $0}
END{while(match(b,a,m)){
    if(m[0]==a) {print "included";exit}
    b=substr(b,RSTART+RLENGTH)
   }
    print "not included"
}' file2 file1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top