Pergunta

I just implemented a program that uses the Stanford POS tagger in Java.

I used an input file of a few KB in size, consisting of a few hundred words. I even set the heap size to 600 MB.

But it is still slow and sometimes runs out of heap memory. How can I increase its execution speed and memory performance? I would like to be able to use a few MB as input.

  public static void postag(String args) throws ClassNotFoundException

  {

     try

     {

     File filein=new File("c://input.txt");

     String content = FileUtils.readFileToString(filein);

     MaxentTagger tagger = new MaxentTagger("postagging/wsj-0-18-bidirectional-distsim.tagger");

     String tagged = tagger.tagString(content);

        try 
        {
            File file = new File("c://output.txt");
            if (!file.exists()) 
            {
                file.createNewFile();
            } 

            FileWriter fw = new FileWriter(file.getAbsoluteFile());
            BufferedWriter bw = new BufferedWriter(fw);
            bw.write("\n"+tagged);
            bw.close();

            }
              catch (IOException e) 
              {
                    e.printStackTrace();
               }

     } catch (IOException e1)
     {
         e1.printStackTrace();
     }

 }
Foi útil?

Solução

The main first piece of advice is to use the wsj-0-18-left3words-distsim.tagger (or probably better, the english-left3words-distsim.tagger in recent versions, for general text), rather than the wsj-0-18-bidirectional-distsim.tagger. While the tagging performance of the bidirectional tagger is fractionally better, it is about 6 times slower and uses about twice as much memory. A figure FWIW: on a 2012 MacBook Pro, when given enough text to "warm up" the left3words tagger will tag text at about 35000 words per second.

The other piece of advice on memory use is that if you have a large amount of text, make sure you pass it to tagString() in reasonable-sized chunks, not all as one huge String, since that whole String will be tokenized at once, adding to the memory requirements.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top