The main first piece of advice is to use the wsj-0-18-left3words-distsim.tagger
(or probably better, the english-left3words-distsim.tagger
in recent versions, for general text), rather than the wsj-0-18-bidirectional-distsim.tagger
. While the tagging performance of the bidirectional tagger is fractionally better, it is about 6 times slower and uses about twice as much memory. A figure FWIW: on a 2012 MacBook Pro, when given enough text to "warm up" the left3words
tagger will tag text at about 35000 words per second.
The other piece of advice on memory use is that if you have a large amount of text, make sure you pass it to tagString()
in reasonable-sized chunks, not all as one huge String, since that whole String will be tokenized at once, adding to the memory requirements.