문제

I wish to read in an XML file, find all sentences that contain both the markup <emotion> and the markup <LOCATION>, then print those entire sentences to a unique line. Here is a sample of the code:

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer <pronoun> I </pronoun> have ever heard." 

out = open('out.txt', 'w')

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bwonderful(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

The regex here grabs all sentences with "wonderful" and "omaha" in them, and returns:

Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>.

Which is perfect, but I really want to print all sentences that contain both <emotion> and <LOCATION>. For some reason, though, when I replace "wonderful" in the regex above with "emotion," the regex fails to return any output. So the following code yields no result:

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard." 

out = open('out.txt', 'w')

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

My question is: How can I modify my regular expression in order to grab only those sentences that contain both <emotion> and <LOCATION>? I would be most grateful for any help others can offer on this question.

(For what it's worth, I'm working on parsing my text in BeautifulSoup as well, but wanted to give regular expressions one last shot before throwing in the towel.)

도움이 되었습니까?

해결책

Your problem appears to be that your regex is expecting a space (\s) to follow the matching word, as seen with:

emotion(?=\s|\.|$)

Since when it's part of a tag, it's followed by a >, rather than a space, no match is found since that lookahead fails. To fix it, you can just add the > after emotion, like:

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)

Upon testing, this seems to solve your problem. Make sure and treat "LOCATION" similarly:

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bLOCATION>(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)

다른 팁

If I do not understand bad what you are trying to do is remove <emotion> </emotion> <LOCATION></LOCATION> ??

Well if is that what you want to do you can do this

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard." 

out = open('out.txt', 'w')

def remove_xml_tags(xml):
    content = re.compile(r'<.*?>')
    return content.sub('', xml)

data = remove_xml_tags(text)

out.write(data + '\n')

out.close()

I have just discovered that the regex may be bypassed altogether. To find (and print) all sentences that contain two identified classes of markup, you can use a simple for loop. In case it might help others who find themselves where I found myself, I'll post my code:

# read in your file
f = open('sampleinput.txt', 'r')

# use read method to convert the read data object into string
readfile = f.read()

#########################
# now use the replace() method to clean data
#########################

# replace all \n with " "
nolinebreaks = readfile.replace('\n', ' ')

# replace all commas with ""
nocommas = nolinebreaks.replace(',', '')

# replace all ? with .
noquestions = nocommas.replace('?', '.')

# replace all ! with .
noexclamations = noquestions.replace('!', '.')

# replace all ; with .
nosemicolons = noexclamations.replace(';', '.')

######################
# now use replace() to get rid of periods that don't end sentences
######################

# replace all Mr. with Mr
nomisters = nosemicolons.replace('Mr.', 'Mr') 

#replace 'Mrs.' with 'Mrs' etc. 

cleantext = nomisters

#now, having cleaned the input, find all sentences that contain your two target words. To find markup, just replace "Toby" and "pipe" with <markupclassone> and <markupclasstwo>

periodsplit = cleantext.split('.')
for x in periodsplit:
    if 'Toby' in x and 'pipe' in x:
        print x
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top