Question

I am trying to extract all the sentence containing a specified word from a text.

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

but it is returning me :

[".I like to eat apple. Me too. Let's go buy some apples."]

instead of :

[".I like to eat apple., "Let's go buy some apples."]

Any help please ?

Was it helpful?

Solution 2

In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]

OTHER TIPS

No need for regex:

>>> txt = "I like to eat apple. Me too. Let's go buy some apples."
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
['I like to eat apple.', " Let's go buy some apples."]
In [7]: import re

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."

In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

But note that @jamylak's split-based solution is faster:

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
1000000 loops, best of 3: 1.96 us per loop

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
1000000 loops, best of 3: 819 ns per loop

The speed difference is less, but still significant, for larger strings:

In [24]: txt = txt*10000

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
100 loops, best of 3: 8.49 ms per loop

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
100 loops, best of 3: 6.35 ms per loop

You can use str.split,

>>> txt="I like to eat apple. Me too. Let's go buy some apples."
>>> txt.split('. ')
['I like to eat apple', 'Me too', "Let's go buy some apples."]

>>> [ t for t in txt.split('. ') if 'apple' in t]
['I like to eat apple', "Let's go buy some apples."]
r"\."+".+"+"apple"+".+"+"\."

This line is a bit odd; why concatenate so many separate strings? You could just use r'..+apple.+.'.

Anyway, the problem with your regular expression is its greedy-ness. By default a x+ will match x as often as it possibly can. So your .+ will match as many characters (any characters) as possible; including dots and apples.

What you want to use instead is a non-greedy expression; you can usually do this by adding a ? at the end: .+?.

This will make you get the following result:

['.I like to eat apple. Me too.']

As you can see you no longer get both the apple-sentences but still the Me too.. That is because you still match the . after the apple, making it impossible to not capture the following sentence as well.

A working regular expression would be this: r'\.[^.]*?apple[^.]*?\.'

Here you don’t look at any characters, but only those characters which are not dots themselves. We also allow not to match any characters at all (because after the apple in the first sentence there are no non-dot characters). Using that expression results in this:

['.I like to eat apple.', ". Let's go buy some apples."]

Obviously, the sample in question is extract sentence containing substring instead of
extract sentence containing word. How to solve the extract sentence containing word problem through python is as follows:

A word can be in the begining|middle|end of the sentence. Not limited to the example in the question, I would provide a general function of searching a word in a sentence:

def searchWordinSentence(word,sentence):
    pattern = re.compile(' '+word+' |^'+word+' | '+word+' $')
    if re.search(pattern,sentence):
        return True

limited to the example in the question, we can solve like:

txt="I like to eat apple. Me too. Let's go buy some apples."
word = "apple"
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

The corresponding output is:

['I like to eat apple']
import nltk
search = "test"
text = "This is a test text! Best text ever. Cool"
contains = [s for s in nltk.sent_tokenize(text) if search in s]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top