Question

I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering:

  1. period . that is followed by a \s (whitespace), \S (like " ') and followed by [A-Z] will split
  2. not to split [0-9]\.[A-Za-z], like 1.stackoverflow real time solution.

My program only solve half of 1 - period (.) that is followed by a \s and [A-Z]. Below is the code:

# -*- coding: utf-8 -*-
import re, sys

source = open(sys.argv[1], 'rb')
dest = open(sys.argv[2], 'wb')
sent = []
for line in source:
    line1 = line.strip('\n')
    k = re.sub(r'\.\s+([A-Z“])'.decode('utf8'), '.\n\g<1>', line1)
    sent.append(k)

for line in sent:
    dest.write(''.join(line))

Pls! I'd like to know which is the best way to master regex. It seems to be confusing.

Was it helpful?

Solution

To include the single quote in the character class, escape it with a \. The regex should be:

\.\s+[A-Z"\']

That's really all you need. You only need to tell a regex what to match, you don't need to specify what you don't want to match. Everything that doesn't fit the pattern won't match.

This regex will match any period followed by whitespace followed by a capital letter or a quote. Since a period immediately preceded by an number and immediately followed by a letter doesn't meet those criteria, it won't match.

This is assuming that the regex you had was working to split a period followed by whitespace followed by a capital, as you stated. Note, however, that this means that I am Sam. Sam I am. would split into I am Sam and am I am. Is that really what you want? If not, use zero-width assertions to exclude the parts you want to match but also keep. Here are your options, in order of what I think it's most likely you want.

1) Keep the period and the first letter or opening quote of the next sentence; lose the whitespace:

(?<=\.)\s+(?=[A-Z"\'])

This will split the example above into I am Sam. and Sam I am.

2) Keep the first letter of the next sentence; lose the period and whitespace:

\.\s+(?=[A-Z"\'])

This will split into I am Sam and Sam I am. This presumes that there are more sentences afterward, otherwise the period will stay with the second sentence, because it's not followed by whitespace and a capital letter or quote. If this option is the one you want - the sentences without the periods, then you might want to also match a period followed by the end of the string, with optional intervening whitespace, so that the final period and any trailing whitespace will be dropped:

\.(?:\s+(?=[A-Z"\'])|\s*$)

Note the ?:. You need non-capturing parentheses, because if you have capture groups in a split, anything captured by the group is added as an element in the results (e.g. split('(+)', 'a+b+c' gives you an array of a + b + c rather than just a b c).

3) Keep everything; whitespace goes with the preceding sentence:

(?<=\.\s+)(?=[A-Z"\'])

This will give you I am Sam. and Sam I am.

Regarding the last part of your question, the best resource for regex syntax I've seen is http://www.regular-expressions.info. Start with this summary: http://www.regular-expressions.info/reference.html Then go to the Tutorial page for more advanced details: http://www.regular-expressions.info/tutorial.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top