Sentence matching with regex

Question

To include the single quote in the character class, escape it with a \. The regex should be:

\.\s+[A-Z"\']

That's really all you need. You only need to tell a regex what to match, you don't need to specify what you don't want to match. Everything that doesn't fit the pattern won't match.

This regex will match any period followed by whitespace followed by a capital letter or a quote. Since a period immediately preceded by an number and immediately followed by a letter doesn't meet those criteria, it won't match.

This is assuming that the regex you had was working to split a period followed by whitespace followed by a capital, as you stated. Note, however, that this means that I am Sam. Sam I am. would split into I am Sam and am I am. Is that really what you want? If not, use zero-width assertions to exclude the parts you want to match but also keep. Here are your options, in order of what I think it's most likely you want.

1) Keep the period and the first letter or opening quote of the next sentence; lose the whitespace:

(?<=\.)\s+(?=[A-Z"\'])

This will split the example above into I am Sam. and Sam I am.

2) Keep the first letter of the next sentence; lose the period and whitespace:

\.\s+(?=[A-Z"\'])

This will split into I am Sam and Sam I am. This presumes that there are more sentences afterward, otherwise the period will stay with the second sentence, because it's not followed by whitespace and a capital letter or quote. If this option is the one you want - the sentences without the periods, then you might want to also match a period followed by the end of the string, with optional intervening whitespace, so that the final period and any trailing whitespace will be dropped:

\.(?:\s+(?=[A-Z"\'])|\s*$)

Note the ?:. You need non-capturing parentheses, because if you have capture groups in a split, anything captured by the group is added as an element in the results (e.g. split('(+)', 'a+b+c' gives you an array of a + b + c rather than just a b c).

3) Keep everything; whitespace goes with the preceding sentence:

(?<=\.\s+)(?=[A-Z"\'])

This will give you I am Sam. and Sam I am.

Regarding the last part of your question, the best resource for regex syntax I've seen is http://www.regular-expressions.info. Start with this summary: http://www.regular-expressions.info/reference.html Then go to the Tutorial page for more advanced details: http://www.regular-expressions.info/tutorial.html