Question

I am not able to split sentences on \n or \r using the Stanford NLP WordsToSentencesAnnotator. I am just trying to use the code as described in here: http://nlp.stanford.edu/software/sutime.shtml, but I am using custom splitter

public static void main(String[] args) {
Properties props = new Properties();
AnnotationPipeline pipeline = new AnnotationPipeline();
pipeline.addAnnotator(new PTBTokenizerAnnotator(false));
pipeline.addAnnotator(new WordsToSentencesAnnotator(false,"\n"));
pipeline.addAnnotator(new POSTaggerAnnotator(false));
pipeline.addAnnotator(new TimeAnnotator("sutime", props));

...

I am using version 1.3.5 of the corenlp jar. I also tried using \r, \r\n etc. in place of \n, but nothing seems to be working. Any help?

Was it helpful?

Solution

Well, that is not the way I would build a pipeline, but have you tried

WordsToSentencesAnnotator newlineSplitter(false, "\n");

So, I would try something more like:

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

to interact with the pipeline. "SUTime annotations are provided automatically with the StanfordCoreNLP pipeline by including the ner annotator" according to the Stanford NLP page and therefore you should able to accomplish the same thing. Your sentence splitting annotator is ssplit. The following options are available for ssplit (once again taken from the Stanford NLP page):

  • ssplit.eolonly: only split sentences on newlines. Works well in conjunction with "-tokenize.whitespace true", in which case StanfordCoreNLP will treat the input as one sentence per line, only separating words on whitespace.
  • ssplit.isOneSentence: each document is to be treated as one sentence, no sentence splitting at all.
  • ssplit.newlineIsSentenceBreak: Whether to treat newlines as sentence breaks. This property has 3 legal values: "always", "never", or "two". The default is "two". "always" means that a newline is always a sentence break (but there still may be multiple sentences per line). This is often appropriate for texts with soft line breaks. "never" means to ignore newlines for the purpose of sentence splitting. This is appropriate when just the non-whitespace characters should be used to determine sentence breaks. "two" means that two or more consecutive newlines will be treated as a sentence break. This option can be appropriate when dealing with text with hard line breaking, and a blank line between paragraphs.
  • ssplit.boundaryMultiTokenRegex: Value is a multi-token sentence boundary regex.
  • ssplit.boundaryTokenRegex:
  • ssplit.boundariesToDiscard:
  • ssplit.htmlBoundariesToDiscard
  • ssplit.tokenPatternsToDiscard:
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top