Question

I have around 6m documents, each of which I have a fairly large set of stopwords to remove from each document.

The trick I learnt was to remove these by having a compiled pattern using re. However now I am getting an OverflowError.

I handle my stopwords as follows:

states_string =r'\b(' + '|'.join(states) + r')\b'
states_pattern = re.compile(states_string)

states is clearly a list of strings such as ['NY', 'CA',...] <- can't paste these all up as will exceed limit for a post by far!

The error I get is: OverflowError: regular expression code size limit exceeded.

Clearly my string of which I am then compiling the pattern is too long.

Does anyone have any suggestions as to how to deal with this, or an alternative method.

One I do know of is: [word for word in words if not word in stopwords] but this iterates through every word, so not ideal.

Please note, length of stopwords is 2500.

Was it helpful?

Solution

As far as I see it, you have 3 options - split into smaller regex, use something like a python set, or shell out (to sed or awk). Let's assume you have a document full of words and a list of stopwords, and you want a different document of words - stopwords.

Regex:

stopwords_regex_list = []
chunk_size = 100  # can tweak depending on size
for i in xrange(0, len(stopwords), chunk_size):
    stopwords_slice = stopwords[i:i + chunk_size]
    stopwords_regex_list.append(re.compile('\b(' + '|'.join(stopwords_slice) + ')\b'))
    with open('document') as doc:
        words = doc.read()  # can read only a certain size if the files are massive
    with open('regex_document', 'w') as regex_doc:
        for regex in stopwords_regex_list:
            words = regex.sub('', words)
        regex_doc.write(words)

Sets:

stopwords_set = set(stopwords)
with open('document') as doc:
    words = doc.read()
    with open('set_document', 'w') as set_doc:
        for word in words.split(' '):
            if not word in stopwords_set:
                set_doc.write(word + ' ')

Sed:

with open('document') as doc:
    with open('sed_script', 'w') as sed_script:
        sed_script.writelines(['s/\<{}\>//g\n'.format(word) for word in stopwords])
    with open('sed_document', 'w') as sed_doc:
        subprocess.call(['sed', '-f', 'sed_script'], stdout=sed_doc, stdin=doc)

I'm not a sed expert so there might be a better way to do it than that. You may want to code up each method and see which works best for you.

OTHER TIPS

This appears to be a hard limit in the implementation of Python's regular expression engine:

~/py27 $ ack -C3 'regular expression code size'
Modules/_sre.c
2756-        if (value == (unsigned long)-1 && PyErr_Occurred()) {
2757-            if (PyErr_ExceptionMatches(PyExc_OverflowError)) {
2758-                PyErr_SetString(PyExc_OverflowError,
2759:                                "regular expression code size limit exceeded");
2760-            }
2761-            break;
2762-        }
2763-        self->code[i] = (SRE_CODE) value;
2764-        if ((unsigned long) self->code[i] != value) {
2765-            PyErr_SetString(PyExc_OverflowError,
2766:                            "regular expression code size limit exceeded");
2767-            break;
2768-        }
2769-    }

To get around the limit, you may need an alternate engine. I recommend using Python to generate a sed script. Here's a rough idea to help you get started:

stopwords = '''
the an of by
for but is why'''.split()

print '#!/bin/sed -f'
for word in stopwords:
    print '/%s/ d' % word

I've ran the following, and that worked just fine:

>>> states = ['AL', 'AK', 'AS', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FM', 'FL', 'GA', 'GU', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MH', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'MP', 'OH', 'OK', 'OR', 'PW', 'PA', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VI', 'VA', 'WA', 'WV', 'WI', 'WY', 'AE', 'AA', 'AP']
>>> states_string = r'\b(' + '|'.join(states) + r')\b'
>>> states_pattern = re.compile(states_string)
>>> states_pattern
<_sre.SRE_Pattern object at 0x00000000034D3C40>

This is the best I could do with the information you've given. Please do post the entire array in your question, because otherwise there is no way for us to know if you've used anything other than this 50-statecode array for your list generation.

PS: credit where credit is due: the array I used here was largely based on this gist comment.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top