Python regex to print all sentences that contain two identified classes of markup

Question 1

Your problem appears to be that your regex is expecting a space (\s) to follow the matching word, as seen with:

emotion(?=\s|\.|$)

Since when it's part of a tag, it's followed by a >, rather than a space, no match is found since that lookahead fails. To fix it, you can just add the > after emotion, like:

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)

Upon testing, this seems to solve your problem. Make sure and treat "LOCATION" similarly:

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bLOCATION>(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)

Question 2

If I do not understand bad what you are trying to do is remove <emotion> </emotion> <LOCATION></LOCATION> ??

Well if is that what you want to do you can do this

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard." 

out = open('out.txt', 'w')

def remove_xml_tags(xml):
    content = re.compile(r'<.*?>')
    return content.sub('', xml)

data = remove_xml_tags(text)

out.write(data + '\n')

out.close()

Question 3

I have just discovered that the regex may be bypassed altogether. To find (and print) all sentences that contain two identified classes of markup, you can use a simple for loop. In case it might help others who find themselves where I found myself, I'll post my code:

# read in your file
f = open('sampleinput.txt', 'r')

# use read method to convert the read data object into string
readfile = f.read()

#########################
# now use the replace() method to clean data
#########################

# replace all \n with " "
nolinebreaks = readfile.replace('\n', ' ')

# replace all commas with ""
nocommas = nolinebreaks.replace(',', '')

# replace all ? with .
noquestions = nocommas.replace('?', '.')

# replace all ! with .
noexclamations = noquestions.replace('!', '.')

# replace all ; with .
nosemicolons = noexclamations.replace(';', '.')

######################
# now use replace() to get rid of periods that don't end sentences
######################

# replace all Mr. with Mr
nomisters = nosemicolons.replace('Mr.', 'Mr') 

#replace 'Mrs.' with 'Mrs' etc. 

cleantext = nomisters

#now, having cleaned the input, find all sentences that contain your two target words. To find markup, just replace "Toby" and "pipe" with <markupclassone> and <markupclasstwo>

periodsplit = cleantext.split('.')
for x in periodsplit:
    if 'Toby' in x and 'pipe' in x:
        print x