Question

I wish to use a regex in Python that reads text, finds all instances in which < emotion > markup exists within the same sentence as < location > markup, then allows those sentences to be print to a unique line of an output file:

import re
out = open('out.txt', 'w')

readfile = "<location> Oklahoma </location> where the wind comes <emotion> sweeping </emotion> down <location> the plain </location>. And the waving wheat. It can sure smell <emotion> sweet </emotion>." 

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\blocation>(?=\s|\.|$)).*?\.(?=\s|$))', readfile, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

The trouble is that if I read in a file that contains line breaks, the regex fails:

import re
out = open('out.txt', 'w')

readfile = "<location> Oklahoma </location> where the wind \n comes <emotion> sweeping </emotion> down <location> the plain </location>. And the waving wheat. It can sure smell <emotion> sweet </emotion>." 

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\blocation>(?=\s|\.|$)).*?\.(?=\s|$))', readfile, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

Is there any way to modify this regular expression so that it won't choke when it hits \n? I would be most grateful for any advice others can lend on this question.

Was it helpful?

Solution

Add re re.S or re.DOTALL (they are the same thing) to the flags in your regex. This will cause . to also match newlines. So the new value for the flags argument would be re.I | re.S.

OTHER TIPS

Use re.DOTALL / re.S

flags = re.DOTALL | re.I
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top