Split texts into sentences fast (Java)

Question 1

Yes it helps to mention you're working with German :)

A regex-based sentence detector with list of abbreviations can be found in GATE. It uses the three files located here. The regular expressions are pretty simple:

//more than 2 new lines
(?:[\u00A0\u2007\u202F\p{javaWhitespace}&&[^\n\r]])*(\n\r|\r\n|\n|\r)(?:(?:[\u00A0\u2007\u202F\p{javaWhitespace}&&[^\n\r]])*\1)+

//between 1 and 3 full stops
\.{1,3}"?

//up to 4 ! or ? in sequence
(!|\?){1,4}"?

The code that uses these 3 files can be found here.

I would enhance the regular expressions with what which could be found on the web, like this one.

Then I would think of all the German translations of the words in the GATE list. If that's not enough, I would go through a few of these abbreviation lists: 1, 2, and create the list on my own.

EDIT:

If performance is so important, I wouldn't use the whole GATE for a sentence splitter - it would take time and memory to switch to their documents, create annotations, then parse them back, etc.

I think the best way for you is to get the code from RegexSentenceSplitter class (the link above) and adjust it to your context.

I think the code is too long to paste here. You should see the execute() method. In general it finds all matches for internal, external and blocking regular expressions, then iterates and uses only those internal and external, which don't overlap with any of the blocking.

Here are some fragments you should look at/reuse:

How the files are parsed

// for each line
if(patternString.length() > 0) patternString.append("|");
patternString.append("(?:" + line + ")");

//...
return Pattern.compile(patternString.toString());

In the execute method, how the blocking splits are filled:

Matcher nonSplitMatcher = nonSplitsPattern.matcher(docText);
//store all non split locations in a list of pairs
List<int[]> nonSplits = new LinkedList<int[]>();
while(nonSplitMatcher.find()){
   nonSplits.add(new int[]{nonSplitMatcher.start(), nonSplitMatcher.end()});
}

Also check the veto method which "Checks whether a possible match is being vetoed by a non split match. A possible match is vetoed if it any nay overlap with a veto region."

Hope this helps.

Question 2

Maybe String.split("\\. |\\? |! "); does it?

Question 3

In general, I think OpenNLP will be better (performance-wise) than rule-based segmenters like Stanford segmenter or implementing a regular expression to solve the task. Rule based segmenters are bound to miss some exceptions. Like, for example, the German sentence, "Ich wurde am 17. Dezember geboren" (I was born on 17th December) will be mistakenly broken into 2 sentences after 17. by a lot of rule-based segmenters, especially if they are built on English rules and not German. Sentences like these will occur even if your text quality is really great as they constitute grammatically correct German. It is very important therefore to check which language-model the segmenter you want to use, is modelled upon.

PS: Amongst OpenNLP, BreakIterator segmenter and Stanford segmenter, OpenNLP worked best for me.

Question 4

It's probably worth mentioning that the Java standard API library provides locale dependent functionality for detecting test boundaries. A BreakIterator can be used to determine sentence boundaries.

Question 5

There is one more solution. Dont know how with performance in compare to your solution but for sure the most comprehensive. You can use ICU4J library and srx files. Library you can download here http://site.icu-project.org/download/52#TOC-ICU4J-Download. Works like a charm its multilingual.

package srx;

import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;

import net.sf.okapi.common.ISegmenter;
import net.sf.okapi.common.LocaleId;
import net.sf.okapi.common.Range;
import net.sf.okapi.lib.segmentation.LanguageMap;
import net.sf.okapi.lib.segmentation.Rule;
import net.sf.okapi.lib.segmentation.SRXDocument;

public class Main {

/**
 * @param args
 */
public static void main(String[] args) {

    if(args.length != 2) return;

    SRXDocument doc = new SRXDocument();

    String srxRulesFilePath = args[0];
    String text = args[1];
    doc.loadRules(srxRulesFilePath);
    LinkedHashMap<String, ArrayList<Rule>> rules =  doc.getAllLanguageRules();
    ArrayList<LanguageMap> languages = doc.getAllLanguagesMaps();
    ArrayList<Rule> plRules = doc.getLanguageRules(languages.get(0).getRuleName());     
    LocaleId locale = LocaleId.fromString("pl_PL");     
    ISegmenter segmenter = doc.compileLanguageRules(LocaleId.fromString("pl_PL"), null);


    segmenter.computeSegments(text);

    List<Range> ranges = segmenter.getRanges();

    System.out.println(ranges.size());
    for (Range range : ranges) {
        System.out.println(range.start);
        System.out.println(range.end);
    }
}

}