String patternStr2 = "(?<!\\..)(?<![A-Z].)[\\.\\?!](?!.\\.)
"; then by using java Matcher find() method, all the sentences can be got.
Include the period in a sentence - regular expression
-
29-05-2022 - |
質問
I have 40,000 lines and need to divide each line into different sentences. Now I'm using pattern like this:
String patternStr2 = "\\s*[\"']?\\s*([A-Z0-9].*?[\\.\\?!]\\s)['\"]?\\s*";
It can handle almost all the sentences, but for sentences like this: U.S. Navy, World War I. would be divided into 2 part: U.S. and Navy, World War I.
Is there any solution to handle this problem?
解決 3
他のヒント
Ok I think you should not use regex for this, but I couldn't resist throwing in some.
If this is hard to understand let me know and I'll add some comments...
package test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
private static final Pattern SENTENCE_DELIMITER =
Pattern.compile("((.+?)((?<!\\.[A-Z])(\\.\\s)(.+))?)");
public static void main(String[] args) {
String lineWithOneSentence =
"U.S. Navy, World War I";
String lineWithTwoSentences =
"U.S. Navy, World War I. U.S. Air Force, World War III.";
Matcher matcher = SENTENCE_DELIMITER.matcher(lineWithOneSentence);
if (matcher.matches()) {
for (int i = 0; i <= matcher.groupCount(); i++) {
switch (i) {
case 0:
System.out.println("WHOLE MATCH: " + matcher.group(i));
break;
case 2:
System.out.println("FIRST SENTENCE: "+ matcher.group(i));
break;
case 5:
System.out.println("SECOND SENTENCE: " + matcher.group(i));
default:
}
}
}
matcher = SENTENCE_DELIMITER.matcher(lineWithTwoSentences);
if (matcher.matches()) {
for (int i = 0; i <= matcher.groupCount(); i++) {
switch (i) {
case 0:
System.out.println("WHOLE MATCH: " + matcher.group(i));
break;
case 2:
System.out.println("FIRST SENTENCE: "+ matcher.group(i));
break;
case 5:
System.out.println("SECOND SENTENCE: " + matcher.group(i));
default:
}
}
}
}
}
The workaround here is to:
- Use groups
- Use a negative lookbehind for dots followed by a space, to ensure they are not preceded by a dot followed by a capital letter (as in "U*.S*._")
This is rather overkill and will probably be a problem at some point, i.e. if your text is not coherent as per punctuation.
Ouput:
WHOLE MATCH: U.S. Navy, World War I
FIRST SENTENCE: U.S. Navy, World War I
SECOND SENTENCE: null
WHOLE MATCH: U.S. Navy, World War I. U.S. Air Force, World War III.
FIRST SENTENCE: U.S. Navy, World War I
SECOND SENTENCE: U.S. Air Force, World War III.
Why are you trying to match while you want to split ?
Use the following regex:
(?<!\..)\.(?!.\.)
Explanation:
(?<!\..)
: Negative lookbehind, check if there is no point 2 characters behind.\.
: Match a point.(?!.\.)
: Negative look ahead, check if there is no point 2 characters ahead.
Note: Not sure how to do this in JAVA, but I think you should try (?<!\\..)\\.(?!.\\.)
. Also don't forget to add a point to your splitted sentences.