Match data in two files

Question

You want to read one file into memory first, storing it in a set. Membership testing in a set is very efficient, much more so than looping over the lines of the second file for every line in the first file.

Then you only need to read the second file, and line by line process it and test if lines match.

What file you keep in memory depends on the size of All.txt. If it is < 1000 lines or so, just keep that in memory and compare it to the other files. If All.txt is really large, re-open it for every file1 you process, and read only the first 18 lines of file1 into memory and match those against every line in All.txt, line by line.

To read just 18 lines of a file, use itertools.islice(); files are iterables and islice() is the easiest way to pick a subset of lines to read.

Reading All.txt into memory first:

from itertools import islice

with open ("c:/All.txt", "r") as all:
    # storing lines without whitespace to make matching a little more robust
    all_lines = set(line.strip() for line in all)

for filename in find_files('a-zA-Z0-9', '*.txt'):
    with open(filename, "r") as file1:
        for line in islice(file1, 18):
            if line.strip() in all_lines:
                 # matched line

If All.txt is large, store those 18 lines of each file in a set first, then re-open All.txt and process it line by line:

for filename in find_files('a-zA-Z0-9', '*.txt'):
    with open(filename, "r") as file1:
        file1_lines = set(line.strip() for line in islice(file1, 18))
    with open ("c:/All.txt", "r") as all:
        for line in all:
            if line.strip() in file1_lines:
                 # matched line

Note that you do not have to change directories in find_files(); os.walk() is already passed the directory name. The fnmatch module also has a .filter() method, use that to loop over files instead of using fnmatch.fnmatch() on each file individually:

def find_files(directory, pattern):
    directory = "c:\\TEST"
    for root, dirs, files in os.walk(directory):
        for basename in fnmatch.filter(files, pattern):
            yield os.path.join(root, basename)