Question

I have 40,000 lines and need to divide each line into different sentences. Now I'm using pattern like this:

String patternStr2 = "\\s*[\"']?\\s*([A-Z0-9].*?[\\.\\?!]\\s)['\"]?\\s*";

It can handle almost all the sentences, but for sentences like this: U.S. Navy, World War I. would be divided into 2 part: U.S. and Navy, World War I.

Is there any solution to handle this problem?

Was it helpful?

Solution 3

String patternStr2 = "(?<!\\..)(?<![A-Z].)[\\.\\?!](?!.\\.)"; then by using java Matcher find() method, all the sentences can be got.

OTHER TIPS

Ok I think you should not use regex for this, but I couldn't resist throwing in some.

If this is hard to understand let me know and I'll add some comments...

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    private static final Pattern SENTENCE_DELIMITER = 
            Pattern.compile("((.+?)((?<!\\.[A-Z])(\\.\\s)(.+))?)");
    public static void main(String[] args) {
        String lineWithOneSentence = 
                "U.S. Navy, World War I";
        String lineWithTwoSentences = 
                "U.S. Navy, World War I. U.S. Air Force, World War III.";
        Matcher matcher = SENTENCE_DELIMITER.matcher(lineWithOneSentence);
        if (matcher.matches()) {
            for (int i = 0; i <= matcher.groupCount(); i++) {
                switch (i) {
                case 0: 
                    System.out.println("WHOLE MATCH: " + matcher.group(i));
                    break;
                case 2: 
                    System.out.println("FIRST SENTENCE: "+ matcher.group(i));
                    break;
                case 5: 
                    System.out.println("SECOND SENTENCE: " + matcher.group(i));
                default:
                }

            }
        }
        matcher = SENTENCE_DELIMITER.matcher(lineWithTwoSentences);
        if (matcher.matches()) {
            for (int i = 0; i <= matcher.groupCount(); i++) {
                switch (i) {
                case 0: 
                    System.out.println("WHOLE MATCH: " + matcher.group(i));
                    break;
                case 2: 
                    System.out.println("FIRST SENTENCE: "+ matcher.group(i));
                    break;
                case 5: 
                    System.out.println("SECOND SENTENCE: " + matcher.group(i));
                default:
                }
            }
        }
    }
}

The workaround here is to:

  • Use groups
  • Use a negative lookbehind for dots followed by a space, to ensure they are not preceded by a dot followed by a capital letter (as in "U*.S*._")

This is rather overkill and will probably be a problem at some point, i.e. if your text is not coherent as per punctuation.


Ouput:

WHOLE MATCH: U.S. Navy, World War I
FIRST SENTENCE: U.S. Navy, World War I
SECOND SENTENCE: null
WHOLE MATCH: U.S. Navy, World War I. U.S. Air Force, World War III.
FIRST SENTENCE: U.S. Navy, World War I
SECOND SENTENCE: U.S. Air Force, World War III.

Why are you trying to match while you want to split ?

Use the following regex:

(?<!\..)\.(?!.\.)

Explanation:

  1. (?<!\..): Negative lookbehind, check if there is no point 2 characters behind.

  2. \.: Match a point.

  3. (?!.\.): Negative look ahead, check if there is no point 2 characters ahead.

Online demo

Note: Not sure how to do this in JAVA, but I think you should try (?<!\\..)\\.(?!.\\.). Also don't forget to add a point to your splitted sentences.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top