Question

I am using Stanford's NLP parser (http://nlp.stanford.edu/software/lex-parser.shtml) to split a block of text into sentences and then see which sentences contain a given word.

Here is my code so far:

import java.io.FileReader;
import java.io.IOException;
import java.util.List;

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.process.*;

public class TokenizerDemo {

    public static void main(String[] args) throws IOException {
        DocumentPreprocessor dp = new DocumentPreprocessor(args[0]);
        for (List sentence : dp) {
            for (Object word : sentence) {
                System.out.println(word);
                System.out.println(word.getClass().getName());
                if (word.equals(args[1])) {
                    System.out.println("yes!\n");
                }
            }
        }
    }
}

I run the code from the command line using "java TokenizerDemo testfile.txt wall"

The contents of testfile.txt is:

Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall.

So I want the program to detect "wall" in the first sentence ("wall" entered as the second argument on the command line). But the program doesn't detect "wall", because it never prints "yes!". The output of the program is:

Humpty
edu.stanford.nlp.ling.Word
Dumpty
edu.stanford.nlp.ling.Word
sat
edu.stanford.nlp.ling.Word
on
edu.stanford.nlp.ling.Word
a
edu.stanford.nlp.ling.Word
wall
edu.stanford.nlp.ling.Word
.
edu.stanford.nlp.ling.Word
Humpty
edu.stanford.nlp.ling.Word
Dumpty
edu.stanford.nlp.ling.Word
had
edu.stanford.nlp.ling.Word
a
edu.stanford.nlp.ling.Word
great
edu.stanford.nlp.ling.Word
fall
edu.stanford.nlp.ling.Word
.
edu.stanford.nlp.ling.Word

DocumentPreprocessor from the Stanford parser correctly splits the text into two sentences. The problem appears to be with the use of the equals method. Each word has type "edu.stanford.nlp.ling.Word". I've tried accessing the underlying string of the word, so I can then check if the string equals "wall", but I can't figure out how to access it.

If I write the second for loop as "for (Word word : sentence) {" then I get an incompatible types error message on complilation.

Était-ce utile?

La solution

The String content can be accessed by calling the method: word() on edu.stanford.nlp.ling.Word; e.g.

import edu.stanford.nlp.ling.Word;

List<Word> words = ...
for (Word word : words) {
  if (word.word().equals(args(1))) {
    System.err.println("Yes!");
  }
}

Also note that it is better to use generics when defining the List as it means the compiler or IDE will typically warn you if you attempt to compare classes of incompatible types (e.g. Word versus String).

EDIT

Turns out I was looking at an older version of the NLP API. Looking at the most recent DocumentPreprocessor documentation I see that it implements Iterable<List<HasWord>> whereby HasWord defines the word() method. Hence your code should look something like this:

DocumentPreprocessor dp = ...
for (HasWord hw : dp) {
  if (hw.word().equals(args[1])) {
    System.err.println("Yes!");
  }
}

Autres conseils

Since Words can be printed gracefully, a simple word.toString().equals(arg[1]) should suffice.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top