POS tagging in Scala

Question 1

I found a very simple way to do POS tagging in Scala

Step 1

Download stanford tagger version 3.2.0 form the link below

http://nlp.stanford.edu/software/stanford-postagger-2013-06-20.zip

Step 2

Add stanford-postagger jar present in the folder to your project and also place the english-left3words-distsim.tagger file present in the models folder in your project

Then, with the code below you can pos tag a sentence in Scala

              val tagger = new MaxentTagger(
                "english-left3words-distsim.tagger")
              val art_con = "My name is Rahul"
              val tagged = tagger.tagString(art_con)
              println(tagged)

Output: My_PRP$ name_NN is_VBZ Rahul_NNP

Question 2

You might like to consider the FACTORIE toolkit (http://github.com/factorie/factorie). It is a general library for machine learning and graphical models that happens to include an extensive suite of natural language processing components (tokenization, token normalization, morphological analysis, sentence segmentation, part-of-speech tagging, named entity recognition, dependency parsing, mention finding, coreference).

Furthermore it is written entirely in Scala, and it is released under the Apache License.

Documentation is currently sparse, but will be improving in the coming months.

For example, once Maven-based installation is finished you can type at the command line:

bin/fac nlp --pos1 --parser1 --ner1

to launch a socket-listening multi-threaded NLP server. Then query it by piping plain text to its socket number:

echo "Mr. Jones took a job at Google in New York.  He and his Australian wife moved from New South Wales on 4/1/12." | nc localhost 3228

The output is then

1       1       Mr.             NNP     2       nn      O
2       2       Jones           NNP     3       nsubj   U-PER
3       3       took            VBD     0       root    O
4       4       a               DT      5       det     O
5       5       job             NN      3       dobj    O
6       6       at              IN      3       prep    O
7       7       Google          NNP     6       pobj    U-ORG
8       8       in              IN      7       prep    O
9       9       New             NNP     10      nn      B-LOC
10      10      York            NNP     8       pobj    L-LOC
11      11      .               .       3       punct   O

12      1       He              PRP     6       nsubj   O
13      2       and             CC      1       cc      O
14      3       his             PRP$    5       poss    O
15      4       Australian      JJ      5       amod    U-MISC
16      5       wife            NN      6       nsubj   O
17      6       moved           VBD     0       root    O
18      7       from            IN      6       prep    O
19      8       New             NNP     9       nn      B-LOC
20      9       South           NNP     10      nn      I-LOC
21      10      Wales           NNP     7       pobj    L-LOC
22      11      on              IN      6       prep    O
23      12      4/1/12          NNP     11      pobj    O
24      13      .               .       6       punct   O

Of course there is a programmatic API to all this functionality as well.

import cc.factorie._
import cc.factorie.app.nlp._
val doc = new Document("Education is the most powerful weapon which you can use to change the world.")
DocumentAnnotatorPipeline(pos.POS1).process(doc)
for (token <- doc.tokens)
  println("%-10s %-5s".format(token.string, token.posLabel.categoryValue))

will output:

Education  NN   
is         VBZ  
the        DT   
most       RBS  
powerful   JJ   
weapon     NN   
which      WDT  
you        PRP  
can        MD   
use        VB   
to         TO   
change     VB   
the        DT   
world      NN   
.          .

Question 3

I believe the API of the Stanford Parser has changed somewhat, as it does sometimes. apply has the signature, public Tree apply(java.util.List<? extends HasWord> words), and this is what you see in the error message.

What you should use now is parse, which has the signature public Tree parse(java.lang.String sentence).