Question

I'm trying to work my way through some regular expressions; I'm using python.

My task right now is to scrape newspaper articles and look for instances where people have died. Once I have a relevant article, I'm trying to snag the death count for some other things. I'm trying to come up with a few patterns, but I'm having difficulty with one in particular. Take this sample article section:

SANAA, Oct 21 (Reuters) - Three men thought to be al Qaeda militants were killed in an apparent U.S. drone attack on a car in Yemen on Sunday, tribal sources and local officials said.

The code that I'm using to snag the 'three' first does a replace on the entire document, so that the 'three' becomes a '3' before any patterns at all are applied. The pattern relevant to this example is this:

re.compile(r"(\d+)\s(:?men|women|children|people)?.*?(:?were|have been)? killed")

The idea is that this pattern will start with a number, be followed by an optional noun such as one of the ones listed, then have a minimum amount of clutter before finding 'dead' or 'died'. I want to leave room so that this pattern would catch:

3 people have been killed since Sunday

and still catch the instance in the example:

3 men thought to be al qaeda militants were killed

The problem is that the pattern I'm using is collecting the date from the first part of the article, and returning a count of 21. No amount of fiddling so far has enabled me to limit the scope to the digit right beside the word men, followed by the participial phrase, then the relevant 'were killed'.

Any help would be much appreciated. I'm definitely no guru when it comes to RE.

Was it helpful?

Solution

Don't make the men|women|children optional, i.e. take out the question mark after the closing parenthesis. The regex engine will match at the first possible place, regardless of whether repetition operators are greedy or stingy.

Alternatively, or additionally, make the "anything here" pattern only match non-numbers, i.e. replace .*? with \D*?

OTHER TIPS

This is because, you have used the quantifier ?, which matches 0 or 1 of your (:?men|women|children|people) after your digit. So, 21 will match. since it has 0 of them.

Try removing your quantifier after it, to match exactly one of them: -

re.compile(r"(\d+)\s(?:men|women|children|people).*?(?:were|have been)? killed")

UPDATE: - To use ? quantifier and still get the required result, you need to use Look-Ahead Regex, to make sure that your digit is not followed by a string containing a hiephen(-) as is in your example.

re.compile(r"(\d+)(?!.*?-.*?)\s(?:men|women|children|people)?.*?(?:were|have been)? killed")

You use wrong syntax (:?...). You probably wanted to use (?:...).


Use regex pattern

(\d+).*?\b(?:men|women|children|people|)\b.*?\b(?:were|have been|)\b.*?\bkilled\b

or if just spaces are allowed between those words, then

(\d+)\s+(?:men|women|children|people|)\s+(?:were|have been|)\s+killed\b
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top