Question

I'm trying to segment a paragraph to sentences. I selected '.', '?' and '!' as the segmentation symbols. I tried:

format = r'((! )|(. )|(? ))'
delimiter = re.compile(format)
s = delimiter.split(line)

but it gives me sre_constants.error: unexpected end of pattern

I also tried

format = [r'(! )',r'(? )',r'(. )']
delimiter = re.compile(r'|'.join(format))

it also causes error.

What's wrong with my method?

Was it helpful?

Solution

. (wildcard) and ? (zero or one, quantifier) are special regex characters, you need to escape them to use them literally.

However, in your case it would be much simpler to use a character class (inside which these characters aren't special anymore):

split(r'[!.?] ')

A character class [...] stands for "one character, any of the ones included inside the character class".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top