Is your corpus already tokenized? You should really make sure it is.
Anyway, I think you are interested in the groups of the match object:
output.write(''.join(m.groups()) + '\n')
You will then find out that your groups will capture only the last word of each window. You need to put an extra pair of parentheses:
m = re.search(r'((?:\S+\s+){0,8})facility((?:\s+\S+){0,8})', line)
The (?:...)
is a non-capturing group: it defines the scope of {0,8}
, but it doesn't give you an extra group in the result.
Have a look at the Python's official RegEx Howto, or search the web for a RegEx tutorial. And in any case, maybe you should look for an off-the-shelf corpus tool, instead of re-inventing the wheel.
EDIT:
In order to match multiple occurrences of the keyword in one line, use re.findall()
(returns a list) or re.finditer()
(returns an iterator):
context = re.findall(r'((?:\S+\s+){0,8})facility((?:\s+\S+){0,8})', line)
context
will be a list of pairs, ie. the left and the right window for every occurrence of the keyword. Note, however, that it will still not work if two occurrences of the same keyword are have less than 8 words between them, eg.
foo bar facility bla foo bar baz facility foo bar
will generate one match only, for the first occurrence of "facility", having the second one in its right window. The second "facility" will not generate a match of its own, since re.findall()
doesn't do overlapping matches, which means that it will look for another "facility" only after the end of the right context. This also means that, if there are between 9 and 15 words inbetween, the second "facility"'s left window will be short of what the first one already consumed.