open file and read sentence

Question 1

Your regex works on the text above if you do this:

with open(filename) as f:
    text = f.read()

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The only problem is, the regex splits on the dot in "Mr." from your text above, so you need to fix/change that.

One solution to this, though not perfect, is you could take out all occurences of a dot after Mr:

text = re.sub(r'(M\w{1,2})\.', r'\1', text) # no for loop needed for this, like there was before

this Matches an 'M' followed by minimum 1, maximum 2 alphanumeric chars(\w{1,3}), followed by a dot. The parenthesised part of the pattern is grouped and captured, and it's referenced in the replacement as '\1'(or group 1, as you could have more parenthesised groups). So essentially, the Mr. or Mrs. is matched, but only the Mr or Mrs part is captured, and the Mr. or Mrs. is then replaced by the captured part which excludes the dot.

and then :

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

will work the way you want.

Question 2

You may want to try out the text-sentence tokenizer module.

From their example code:

>>> from text_sentence import Tokenizer
>>> t = Tokenizer()
>>> list(t.tokenize("This is first sentence. This is second one!And this is third, is it?"))
[T('this'/sent_start), T('is'), T('first'), T('sentence'), T('.'/sent_end),
 T('this'/sent_start), T('is'), T('second'), T('one'), T('!'/sent_end),
 T('and'/sent_start), T('this'), T('is'), T('third'), T(','/inner_sep),
 T('is'), T('it'), T('?'/sent_end)]

I've never actually tried it though, I'd prefer using NLTK/punkt.

open file and read sentence

Thanks for you attention.