Question

I'm using the Stanford CoreNLP parsers to run through some text and there are date phrases, such as 'the second Monday in October' and 'the past year'. The library will appropriately tag each token as a DATE named entity, but is there a way to programmatically get this whole date phrase? And it's not just dates, ORGANIZATION named entities will do the same ("The International Olympic Committee", for example, could be one identified in a given text example).

String content = "Thanksgiving, or Thanksgiving Day (Canadian French: Jour de"
        + " l'Action de grâce), occurring on the second Monday in October, is"
        + " an annual Canadian holiday which celebrates the harvest and other"
        + " blessings of the past year.";

Properties p = new Properties();
p.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(p);

Annotation document = new Annotation(content);
pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
    for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {

        String word = token.get(CoreAnnotations.TextAnnotation.class);
        String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

        if (ne.equals("DATE")) {
            System.out.println("DATE: " + word);
        }

    }
}

Which, after the Stanford annotator and classifier loading, will yield the output:

DATE: Thanksgiving
DATE: Thanksgiving
DATE: the
DATE: second
DATE: Monday
DATE: in
DATE: October
DATE: the
DATE: past
DATE: year

I feel like the library has to be recognizing the phrases and using them for the named entity tagging, so the question would be is that data kept and available somehow through the api?

Thanks, Kevin

Was it helpful?

Solution

After discussions on the mailing list I've found that the api does not support this. My solution was to just keep the state of the last NE, and build a string if necessary. John B. from the nlp mailing lists was helpful in answering my question.

OTHER TIPS

The named entity tagger and part-of-speech tagger are distinct algorithms within the CoreNLP pipeline and it seems the API consumer is tasked with integrating them.

Please forgive my C# but here is a simple class:

    public class NamedNounPhrase
    {
        public NamedNounPhrase()
        {
            Phrase = string.Empty;
            Tags = new List<string>();
        }

        public string Phrase { get; set; }

        public IList<string> Tags { get; set; }

    }

and some code to find all the top-level noun phrases and their associated named entity tags:

    private void _monkey()
    {

        ...

        var nounPhrases = new List<NamedNounPhrase>();

        foreach (CoreMap sentence in sentences.toArray())
        {
            var tree =
                (Tree)sentence.get(new TreeCoreAnnotations.TreeAnnotation().getClass());

            if (null != tree)
                _walk(tree, nounPhrases);
        }

        foreach (var nounPhrase in nounPhrases)
            Console.WriteLine(
                "{0} ({1})",
                nounPhrase.Phrase,
                string.Join(", ", nounPhrase.Tags)
                );
    }

    private void _walk(Tree tree, IList<NamedNounPhrase> nounPhrases)
    {
        if ("NP" == tree.value())
        {
            var nounPhrase = new NamedNounPhrase();

            foreach (Tree leaf in tree.getLeaves().toArray())
            {
                var label = (CoreLabel) leaf.label();
                nounPhrase.Phrase += (string) label.get(new CoreAnnotations.TextAnnotation().getClass()) + " ";
                nounPhrase.Tags.Add((string) label.get(new CoreAnnotations.NamedEntityTagAnnotation().getClass()));
            }

            nounPhrases.Add(nounPhrase);
        }
        else
        {
            foreach (var child in tree.children())
            {
                _walk(child, nounPhrases);
            }
        }
    }

Hope that helps!

Thanks a lot, I was going to do the same. The Stanford NER API, however, supports classifyToCharOffset (or something like that) to get the whole phrase. I don't know, maybe it is just an implementation of your idea :D.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top