finding dates and their position within a string using stanford nlp

https://stackoverflow.com/questions/17634675

03-06-2022
|

Вопрос

I need to find the dates within a string and their positions. Consider the example string

"The interesting date is 4 days from today and it is 20th july of this year, another date is 18th Feb 1997"

I need the output (Assuming today is 2013-07-14)
2013-07-17, position 25
2013-07-20, position 56
1997-02-18, position 93

I have managed to write the code to get the various parts of the string that is recognized as date. Need to enhance/change this to achieve the above output. Any hints or help is appreciated:

    Properties props = new Properties();
    AnnotationPipeline pipeline = new AnnotationPipeline();
    pipeline.addAnnotator(new PTBTokenizerAnnotator(false));
    pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
    pipeline.addAnnotator(new POSTaggerAnnotator(false));
    pipeline.addAnnotator(new TimeAnnotator("sutime", props));

    Annotation annotation = new Annotation("The interesting date is 4 days from today and it is 20th july of this year, another date is 18th Feb 1997");
    annotation.set(CoreAnnotations.DocDateAnnotation.class, "2013-07-14");
    pipeline.annotate(annotation);
    List<CoreMap> timexAnnsAll = annotation.get(TimeAnnotations.TimexAnnotations.class);
    timexAnnsAll.each(){
        println it
    }

With the above code I get the output as:
4 days from today
20th july of this year
18th Feb 1997

EDIT::
Managed to get the date part, with the following change

timexAnnsAll.each(){it ->  
    Timex timex = it.get(TimeAnnotations.TimexAnnotation.class);  
    println timex.val + " from : $it"  
}

Now the output is:
2013-07-18 from : 4 days from today
2013-07-20 from : 20th july of this year
1997-02-18 from : 18th Feb 1997

All I need to solve now is to find the position of the date within the original string.

Решение

Each CoreMap returned in the list from annotation.get(TimeAnnotations.TimexAnnotations.class) is an Annotation and you can get other attributes of it, such as the list of tokens, each of which stores character offset information. So you can finish off your example like this:

List<CoreMap> timexAnnsAll = annotation.get(TimeAnnotations.TimexAnnotations.class);
for (CoreMap cm : timexAnnsAll) {
  List<CoreLabel> tokens = cm.get(CoreAnnotations.TokensAnnotation.class);
  System.out.println(cm +
          " [from char offset " +
          tokens.get(0).get(CoreAnnotations.CharacterOffsetBeginAnnotation.class) +
          " to " + tokens.get(tokens.size() -1)
          .get(CoreAnnotations.CharacterOffsetEndAnnotation.class) + ']');
  /* -- This shows printing out each token and its character offsets
  for (CoreLabel token : tokens) {
    System.out.println(token +
            ", start: " + token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class) +
            ", end: " + token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class));
  }
  */
}

Then the output is:

4 days from today [from char offset 24 to 41]
20th july of this year [from char offset 52 to 74]
18th Feb 1997 [from char offset 92 to 105]

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow