Question

I am writing a script to split the text into sentences with Python. However I am quite bad with writing more complex regular expressions.

There are 5 rules according to which I wish to split the sentences. I want to split sentences if they:

* end with "!"  or
* end with "?"  or
* end with "..."  or
* end with "." and the full stop is not followed by a number  or
* end with "." and the full stop is followed by a whitespace

What would be the regular expression for this for Python?

Was it helpful?

Solution

You can literally translate your five bullet points to a regular expression:

!|\?|\.{3}|\.\D|\.\s

Note that I'm simply creating an alternation consisting of five alternatives, each of which represents one of your bullet points:

  • !
  • \?
  • \.{3}
  • \.\D
  • \.\s

Since the dot (.) and the question mark (?) are special characters within a regular expression pattern, they need to be escaped by a backslash (\) to be treated as literals. The pipe (|) is the delimiting character between two alternatives.

Using the above regular expression, you can then split your text into sentences using re.split.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top