Speeding up OpenNLP's POSTagging when using it for several texts

https://stackoverflow.com/questions/4368463

09-10-2019
|

Question

I'm currently working on a keyphrase extraction tool, which should provide tag suggestions for texts or documents on a website. As I am following the method proposed in this paper: A New Approach to Keyphrase Extraction Using Neural Networks I am using the OpenNLP toolkit's POSTagger for the first step, i.e. candidate selection.

In general the keyphrase extraction works pretty well. My problem is that I have to perform this expensive loading of the models from their corresponding files every time I want to use the POSTagger:

posTagger = new POSTaggerME(new POSModel(new FileInputStream(new File(modelDir + "/en-pos-maxent.bin"))));
tokenizer = new TokenizerME(new TokenizerModel(new FileInputStream(new File(modelDir + "/en-token.bin"))));
// ...
String[] tokens = tokenizer.tokenize(text);
String[] tags = posTagger.tag(tokens);

This is due to the fact that this code is not on the scope of the webserver itself but inside a "handler" with a lifecycle including only handling one specific request. My question is: How can I achieve loading the files only once? (I don't want to spend 10 seconds on waiting for the models to load and using it just for 200ms afterwards.)

My first idea was to serialize the POSTaggerME (TokenizerME resp.) and deserialize it every time I need it using Java's built-in mechanism. Unfortunately this doesn't work – it raises an exception. (I do serialize the classifier from the WEKA-toolkit which classifies my candidates at the end in order to not having to build (or train) the classifier every time. Therefore I thougth this may be applicable to the POSTaggeME as well. Unfortunately this is not the case.)

In the case of the Tokenizer I could refer to a simple WhitespaceTokenizer which is an inferior solution but not that bad at all:

tokenizer = WhitespaceTokenizer.INSTANCE;

But I don't see this option for a reliable POSTagger.

Solution

Just wrap your tokenization/POS-tagging pipeline in a singleton.

If the underlying OpenNLP code isn't thread safe, put the calls in synchronization blocks, e.g.:

// the singletons tokenization/POS-tagging pipeline 
String[] tokens;
synchronized(tokenizer) { 
   tokens = tokenizer.tokenize(text);
}
String[] tags;
synchronized(posTagger) { 
   tags = posTagger.tag(tokens);
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow