Stanford POS Tagger: How to preserve newlines in the output?

https://stackoverflow.com/questions/12140683

28-06-2021
|

Question

My input.txt file contains the following sample text:

you have to let's
come and see me.

Now if I invoke the Stanford POS tagger with the default command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt

I get the following in my output.txt file:

you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.

The problem with the above output is that I have lost my original newline delimiter used in the input file.

Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt

The problem with this code is that it totally messed up the output:

you_PRP have_VBP to_TO let's_NNS  
come_VB and_CC see_VB me._NN

Here let's and me. are tagged inappropriately.

My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?

Solution

The answer should have been to use the command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -textFile input.txt > output.txt

But there was a bug and it didn't work (ignored the newlines) in version 3.1.3 (and perhaps all earlier versions). It will work in version 3.1.4+.

In the meantime, if the amount of text is small, you might try using the Stanford Parser (where the corresponding flag is named differently so it's -sentences newline).

OTHER TIPS

One thing you can do is use xml input instead of plain text. Your input in that case will be:

<xml version="1.0" encoding="UTF-8">
<text>
    <line>you have to let's</line>
    <line>come and see me.</line>
</text>

Here each line is enclosed in a line tag. You can now issue the following command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -xmlInput line -textFile sample.xml > ouput.xml

Note that the argument '-xmlInput' specifies the tag used for POS tagging. In our case, this tag is line. When you run the above command the output will be:

<?xml version="1.0" encoding="UTF-8"?>
<text>
    <line>
        you_PRP have_VBP to_TO let_VB &apos;s_POS 
    </line>
    <line>
        come_VB and_CC see_VB me_PRP ._. 
    </line>
</text>

Thus you can separate out your lines by reading content enclosed in the line tags.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow