Question

I'm performing POS tagging with the Stanford POS Tagger. The tagger only returns one possible tagging for the input sentence. For instance, when provided with the input sentence "The clown weeps.", the POS tagger produces the (erroneous) "The_DT clown_NN weeps_NNS ._.".

However, my application will try to parse the result, and may reject a POS tagging because there is no way to parse it. Hence, in this example, it would reject "The_DT clown_NN weeps_NNS ._." but would accept "The_DT clown_NN weeps_VBZ ._." which I assume is a lower-confidence tagging for the parser.

I would therefore like the POS tagger to provide multiple hypotheses for the tagging of each word, annotated by some kind of confidence value. In this way, my application could choose the POS tagging with highest confidence that achieves a valid parsing for its purposes.

I have found no way to ask the Stanford POS Tagger to produce multiple (n-best) tagging hypotheses for each word (or even for the whole sentence). Is there a way to do this? (Alternatively, I am also OK with using another POS tagger with comparable performance that would have support for this.)

Was it helpful?

Solution

OpenNLP allows getting n best for POS tagging:

Some applications need to retrieve the n-best pos tag sequences and not only the best sequence. The topKSequences method is capable of returning the top sequences. It can be called in a similar way as tag.

Sequence topSequences[] = tagger.topKSequences(sent);

Each Sequence object contains one sequence. The sequence can be retrieved via Sequence.getOutcomes() which returns a tags array and Sequence.getProbs() returns the probability array for this sequence.

Also, there is also a way to make spaCy do something like this:

Doc.set_extension('tag_scores', default=None)
Token.set_extension('tag_scores', getter=lambda token: token.doc._.tag_scores[token.i])

class ProbabilityTagger(Tagger):
    def predict(self, docs):
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        guesses = []
        for i, doc_scores in enumerate(scores):
            docs[i]._.tag_scores = doc_scores
            doc_guesses = doc_scores.argmax(axis=1)

            if not isinstance(doc_guesses, numpy.ndarray):
                doc_guesses = doc_guesses.get()
            guesses.append(doc_guesses)
        return guesses, tokvecs


Language.factories['tagger'] = lambda nlp, **cfg: ProbabilityTagger(nlp.vocab, **cfg)

Then each token will have tag_scores with the probabilities for each part of speech from spaCy's tag map.

Source: https://github.com/explosion/spaCy/issues/2087

OTHER TIPS

I dont know a tagger which offer several POS interpretation for english phrases (this is for spanish) Other option for you could be change or combine taggers, I mean, using your own example in Freeling I got your expected result

enter image description here

Additionally, you can see Freeling show you also other posible POS interpretation for certain word in its context.

Note: Maybe if you have used Freeling you know that for machine readability you can use the xml output (below your results) and for automatization you can integrate Freeling with python/java but usually I prefer just call it via command line.

We found that the default model for POS taggin wasn't good enough. It turned out that using a different model much better tags. We are currently using wsj-0-18-bidirectional-distsim and the performance is good enough for most tasks. I include it like so:

props.put("pos.model",
    "edu/stanford/nlp/models/pos-tagger/wsj-bidirectional/wsj-0-18-bidirectional-distsim.tagger");
props.put("annotators", "tokenize, ssplit, pos, ...");
pipeline = new StanfordCoreNLP(props);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top