You want to read one file into memory first, storing it in a set. Membership testing in a set is very efficient, much more so than looping over the lines of the second file for every line in the first file.
Then you only need to read the second file, and line by line process it and test if lines match.
What file you keep in memory depends on the size of All.txt
. If it is < 1000 lines or so, just keep that in memory and compare it to the other files. If All.txt
is really large, re-open it for every file1
you process, and read only the first 18 lines of file1
into memory and match those against every line in All.txt
, line by line.
To read just 18 lines of a file, use itertools.islice()
; files are iterables and islice()
is the easiest way to pick a subset of lines to read.
Reading All.txt
into memory first:
from itertools import islice
with open ("c:/All.txt", "r") as all:
# storing lines without whitespace to make matching a little more robust
all_lines = set(line.strip() for line in all)
for filename in find_files('a-zA-Z0-9', '*.txt'):
with open(filename, "r") as file1:
for line in islice(file1, 18):
if line.strip() in all_lines:
# matched line
If All.txt
is large, store those 18 lines of each file in a set first, then re-open All.txt
and process it line by line:
for filename in find_files('a-zA-Z0-9', '*.txt'):
with open(filename, "r") as file1:
file1_lines = set(line.strip() for line in islice(file1, 18))
with open ("c:/All.txt", "r") as all:
for line in all:
if line.strip() in file1_lines:
# matched line
Note that you do not have to change directories in find_files()
; os.walk()
is already passed the directory name. The fnmatch
module also has a .filter()
method, use that to loop over files
instead of using fnmatch.fnmatch()
on each file individually:
def find_files(directory, pattern):
directory = "c:\\TEST"
for root, dirs, files in os.walk(directory):
for basename in fnmatch.filter(files, pattern):
yield os.path.join(root, basename)