Output results in conll format (POS-tagging, stanford pos tagger)

Question

When it comes to the CONLL format, i presume you mean the CONLL2000 chunking task format as such:

   He        PRP  B-NP
   reckons   VBZ  B-VP
   the       DT   B-NP
   current   JJ   I-NP
   account   NN   I-NP
   deficit   NN   I-NP
   will      MD   B-VP
   narrow    VB   I-VP
   to        TO   B-PP
   only      RB   B-NP
   #         #    I-NP
   1.8       CD   I-NP
   billion   CD   I-NP
   in        IN   B-PP
   September NNP  B-NP
   .         .    O

There are three columns in the CONLL chunking task format:

token (i.e. word)
POS tag
BIO (begin, inside, outside) of chunk/phrase tag

Sadly, if you use the stanford MaxEnt tagger, it only give you the token and POS information but has no BIO chunk information.

java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat tsv 2> /dev/null

Using the above command the Stanford POS tagger already give you the tab separated format, just that it's without the 3rd column (see http://nlp.stanford.edu/software/pos-tagger-faq.shtml):

   He        PRP
   reckons   VBZ
   the       DT
   ...

To get the BIO colum, you would require either:

a statistical chunker or
a full parser

see http://www-nlp.stanford.edu/links/statnlp.html for a list of chunker/parser, if you want to stick with stanford tools, i suggest the stanford parser but it gives you the bracketed parse format, which you have to do some post-processing to get it into CONLL2000 format, see http://nlp.stanford.edu/software/lex-parser.shtml