As far as I see it, you have 3 options - split into smaller regex, use something like a python set, or shell out (to sed or awk). Let's assume you have a document full of words and a list of stopwords, and you want a different document of words - stopwords.
Regex:
stopwords_regex_list = []
chunk_size = 100 # can tweak depending on size
for i in xrange(0, len(stopwords), chunk_size):
stopwords_slice = stopwords[i:i + chunk_size]
stopwords_regex_list.append(re.compile('\b(' + '|'.join(stopwords_slice) + ')\b'))
with open('document') as doc:
words = doc.read() # can read only a certain size if the files are massive
with open('regex_document', 'w') as regex_doc:
for regex in stopwords_regex_list:
words = regex.sub('', words)
regex_doc.write(words)
Sets:
stopwords_set = set(stopwords)
with open('document') as doc:
words = doc.read()
with open('set_document', 'w') as set_doc:
for word in words.split(' '):
if not word in stopwords_set:
set_doc.write(word + ' ')
Sed:
with open('document') as doc:
with open('sed_script', 'w') as sed_script:
sed_script.writelines(['s/\<{}\>//g\n'.format(word) for word in stopwords])
with open('sed_document', 'w') as sed_doc:
subprocess.call(['sed', '-f', 'sed_script'], stdout=sed_doc, stdin=doc)
I'm not a sed expert so there might be a better way to do it than that. You may want to code up each method and see which works best for you.