Question

For my research I am trying to count, from a corpus, the number of times (co-occurrence) a series of compound terms (e.g. Safety Hazard) stored in a file, 1 line per phrase, appear within a 16 word window of a target keyword (e.g. Facility). I am not a programmer, have been trying to break it into 2 elements: First extract a file from the corpus where I have a match on my target keyword, with the 8 words before and after. Then try and match my 'vocabulary file' to that extract. I am on part 1, have tried this, but I just get the <_sre.SRE_Match object at 0x028FFE78> message and am struggling trying to use repr: Any suggestions appreciated or other ways to do this. Ultimately I want an export file that has my vocabulary words with a count after them, indicating how often they have been found in that window with my target word. The use of re.search logic is based on what I have found on this message board which is why I tried it:

input=open("Corpus.txt", "r")
matches=[]
lines=input.readlines()
for line in lines:
  m=re.search(r'(\S+\s+){0,8}facility(\s+\S+){0,8}',line)
  if m:
    matches.append(m)
    for m in matches:
      output.write(str(m))
      output.close()

Any help appreciated, Paul

Was it helpful?

Solution

Is your corpus already tokenized? You should really make sure it is.

Anyway, I think you are interested in the groups of the match object:

output.write(''.join(m.groups()) + '\n')

You will then find out that your groups will capture only the last word of each window. You need to put an extra pair of parentheses:

m = re.search(r'((?:\S+\s+){0,8})facility((?:\s+\S+){0,8})', line)

The (?:...) is a non-capturing group: it defines the scope of {0,8}, but it doesn't give you an extra group in the result.

Have a look at the Python's official RegEx Howto, or search the web for a RegEx tutorial. And in any case, maybe you should look for an off-the-shelf corpus tool, instead of re-inventing the wheel.

EDIT:
In order to match multiple occurrences of the keyword in one line, use re.findall() (returns a list) or re.finditer() (returns an iterator):

context = re.findall(r'((?:\S+\s+){0,8})facility((?:\s+\S+){0,8})', line)

context will be a list of pairs, ie. the left and the right window for every occurrence of the keyword. Note, however, that it will still not work if two occurrences of the same keyword are have less than 8 words between them, eg.

foo bar facility bla foo bar baz facility foo bar

will generate one match only, for the first occurrence of "facility", having the second one in its right window. The second "facility" will not generate a match of its own, since re.findall() doesn't do overlapping matches, which means that it will look for another "facility" only after the end of the right context. This also means that, if there are between 9 and 15 words inbetween, the second "facility"'s left window will be short of what the first one already consumed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top