How to match a paragraph using regex

Question 1

You can split on double-newline like this:

paragraphs = re.split(r"\n\n", DATA)

Edit: To capture the paragraphs as matches, so you can get their start and end points, do this:

for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
   print match.start(), match.end()

# Prints:
# 0 214
# 215 298
# 299 589

Question 2

Using split is one way, you can do so with regular expression also like this:

paragraphs = re.search('(.+?\n\n|.+?$)',TEXT,re.DOTALL)

The .+? is a lazy match, it will match the shortest substring that makes the whole regex matched. Otherwise, it will just match the whole string.

So basically here we want to find a sequence of characters (.+?) which ends by a blank line (\n\n) or the end of string ($). The re.DOTALL flag makes the dot to match newline also (we also want to match a paragraph consisting of three lines without blank lines within)

Question 3

Try

^(.+?)\n\s*\n

or

^(.+?)\r\n\s*\r\n

just do not forget append extra new line at the end of text

Question 4

i tried to use the recommended RegEx with the default Java RegEx engine. That gave me several times a StackOverflowException, so in the end i rewrote the RegEx and optimized it a little more.

So this is working fine for me in Java:

(?s)(.*?[^\:\-\,])(?:$|\n{2,})

This also handles the end of document without new lines and tries to concat lines which ends with ':', '-' or ',' to the next paragraph.

And to avoid that trailing blanks (whitespace or tabs) breaks the above described feature i am stripping them before with following regex:

(?m)[[:blank:]]+$

Question 5

The following regex:

\w*\s*|\w|\D

matches these paragraphs perfectly:

The brown fox aged 23 and a half, jumped over the lazy dog! The dog was furry, but not cute, and his mullet was greasy and black.

The next day the dog jumped over the fox - but the fox didn't enjoy it (or did he).

You can test this at https://regex101.com/r/Bvyuaq/1

Question 6

What is the newline symbol? Let us suppose the newline symbol is '\r\n', if you want to match the paragraphs starting with Lorem, you can do like this:

pattern = re.compile('\r\nLorem.*\r\n')
str = '...'    # your source text
matchlist = re.findall(pattern, str)

The matchlist will contain all the paragragh start with Lorem. And the other two words are the same.