When it comes to the CONLL format, i presume you mean the CONLL2000 chunking task format as such:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O
There are three columns in the CONLL chunking task format:
token
(i.e. word)POS
tagBIO
(begin, inside, outside) of chunk/phrase tag
Sadly, if you use the stanford MaxEnt tagger, it only give you the token
and POS
information but has no BIO
chunk information.
java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat tsv 2> /dev/null
Using the above command the Stanford POS tagger already give you the tab separated format, just that it's without the 3rd column (see http://nlp.stanford.edu/software/pos-tagger-faq.shtml):
He PRP
reckons VBZ
the DT
...
To get the BIO
colum, you would require either:
- a statistical chunker or
- a full parser
see http://www-nlp.stanford.edu/links/statnlp.html for a list of chunker/parser, if you want to stick with stanford tools, i suggest the stanford parser but it gives you the bracketed parse format, which you have to do some post-processing to get it into CONLL2000 format, see http://nlp.stanford.edu/software/lex-parser.shtml