Removing stopwords with Python - quickly and efficiently

Question 1

As far as I see it, you have 3 options - split into smaller regex, use something like a python set, or shell out (to sed or awk). Let's assume you have a document full of words and a list of stopwords, and you want a different document of words - stopwords.

Regex:

stopwords_regex_list = []
chunk_size = 100  # can tweak depending on size
for i in xrange(0, len(stopwords), chunk_size):
    stopwords_slice = stopwords[i:i + chunk_size]
    stopwords_regex_list.append(re.compile('\b(' + '|'.join(stopwords_slice) + ')\b'))
    with open('document') as doc:
        words = doc.read()  # can read only a certain size if the files are massive
    with open('regex_document', 'w') as regex_doc:
        for regex in stopwords_regex_list:
            words = regex.sub('', words)
        regex_doc.write(words)

Sets:

stopwords_set = set(stopwords)
with open('document') as doc:
    words = doc.read()
    with open('set_document', 'w') as set_doc:
        for word in words.split(' '):
            if not word in stopwords_set:
                set_doc.write(word + ' ')

Sed:

with open('document') as doc:
    with open('sed_script', 'w') as sed_script:
        sed_script.writelines(['s/\<{}\>//g\n'.format(word) for word in stopwords])
    with open('sed_document', 'w') as sed_doc:
        subprocess.call(['sed', '-f', 'sed_script'], stdout=sed_doc, stdin=doc)

I'm not a sed expert so there might be a better way to do it than that. You may want to code up each method and see which works best for you.

Question 2

This appears to be a hard limit in the implementation of Python's regular expression engine:

~/py27 $ ack -C3 'regular expression code size'
Modules/_sre.c
2756-        if (value == (unsigned long)-1 && PyErr_Occurred()) {
2757-            if (PyErr_ExceptionMatches(PyExc_OverflowError)) {
2758-                PyErr_SetString(PyExc_OverflowError,
2759:                                "regular expression code size limit exceeded");
2760-            }
2761-            break;
2762-        }
2763-        self->code[i] = (SRE_CODE) value;
2764-        if ((unsigned long) self->code[i] != value) {
2765-            PyErr_SetString(PyExc_OverflowError,
2766:                            "regular expression code size limit exceeded");
2767-            break;
2768-        }
2769-    }

To get around the limit, you may need an alternate engine. I recommend using Python to generate a sed script. Here's a rough idea to help you get started:

stopwords = '''
the an of by
for but is why'''.split()

print '#!/bin/sed -f'
for word in stopwords:
    print '/%s/ d' % word

Question 3

I've ran the following, and that worked just fine:

>>> states = ['AL', 'AK', 'AS', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FM', 'FL', 'GA', 'GU', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MH', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'MP', 'OH', 'OK', 'OR', 'PW', 'PA', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VI', 'VA', 'WA', 'WV', 'WI', 'WY', 'AE', 'AA', 'AP']
>>> states_string = r'\b(' + '|'.join(states) + r')\b'
>>> states_pattern = re.compile(states_string)
>>> states_pattern
<_sre.SRE_Pattern object at 0x00000000034D3C40>

This is the best I could do with the information you've given. Please do post the entire array in your question, because otherwise there is no way for us to know if you've used anything other than this 50-statecode array for your list generation.

PS: credit where credit is due: the array I used here was largely based on this gist comment.