Question

I am having some trouble with another regular expression. For this one, my code is supposed to look for the pattern:

re.compile(r"kill(?:ed|ing|s)\D*(\d+).*?(?:men|women|children|people)?")

However, it is matching too aggressively. It happens to match a sentence which has the word 'killing' in it. But the pattern continues to collect until it reaches a digit further down in the text. In particular, it is matching:

killed in an apparent u.s. drone attack on a car in yemen on sunday, tribal sources and local officials said.the men's car was driving through the south-eastern province of maareb, a mostly desert region where militants have taken refuge after being driven from southern strongholds.yemen, where al qaeda militants exploited a security vacuum during last year's uprising that ousted president ali abdullah saleh, has seen an in10

This is not the behavior I'm after. I would like this pattern to fail if it cannot be found inside a single sentence.

The solution I'm trying implement in pseudo code is:

find instance of 'kill'
if what follows contains a period (\.) before a digit, do not match.

My failed implementation looks like this:

re.compile(r"kill(?:ed|ing|s)\D*(?!:\..*?)(\d+).*?(?:men|women|children|people)?")

I've tried a 'look-behind', but I have to specify a width. What I'm trying to do with the above is match any ending of 'kill', followed by any non-digit, but NOT match a period, and anything else is free to follow before the digit I'm after.

Sadly, this code behaves the exact same in my test. Any help would be appreciated.

Was it helpful?

Solution

A small modification:

r"kill(?:ed|ing|s)[^\d.]*(\d+)[^.]*?(?:men|women|children|people)?"

Basically, I prevent full stop . from being matched between kill and men/women/etc. following after.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top