Finding a “subject” from an array of part of speech tags
Question
I know this question is more of a grammar question however how do you determine a "subject" of a sentence if you have a array of Penn Treebank
tokens like:
[WP][VBZ][DT][NN]
Is there any java library that can take in such tokens and determine which one is the subject? Or which ones?
Solution
I have been successfully classifying subjects for Portuguese using OpenNLP. I created a shallow parser tweaking a little the OpenNLP Chunker component.
You can use the existing OpenNLP models for pos tagging and chunking, but you will train a new chunk model that takes the PoS tags + chunk tags to classify subjects.
The data format to train the Chunker is based on Conll 2000:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
...
I then created a new corpus that looks like the following
He PRP+B-NP B-SUBJ
reckons VBZ+B-VP B-V
the DT+B-NP O
current JJ+I-NP O
account NN+I-NP O
deficit NN+I-NP O
will MD+B-VP O
narrow VB+I-VP O
If you have access to Penn Treebank you can create such data by looking for subject nodes in the corpus. Maybe you can start with this Perl script used to generate the data for the CoNLL-2000 Shared Task.
The evaluation results for Portuguese are 87.07 % for precision, 75.48 % for recall, and 80.86 % for F1.
OTHER TIPS
The standard way to label syntactic units of a sentence, including the subject, is with a constituent parser. A constituent tree labels substrings of the input with syntactic labels. See http://en.wikipedia.org/wiki/Parse_tree for an example.
If such a structure looks like it would serve your needs, I'd recommend you grab an off-the-shelf parser and extract the relevant phrase(s) from the output.
Most parsers I'm aware of include part-of-speech (POS) tagging during parsing, but if you're confident in the POS labels you have, you could constrain the parser to use yours.
Note that constitent parsing can be quite expensive computationally. To my knowledge, all state-of-the-art constituent parsers run at 4-80 sentences per second, although you might be able to achieve higher speeds if you're willing to sacrifice some accuracy.
A couple recommendations (more details at Simple Natural Language Processing Startup for Java).
The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ). State-of-the-art accuracy and reasonably fast (3-5 sentences per second).
The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, giving up a bit of accuracy (about 1.5 points in F1-score for those who care) but improving efficiency to around 50-80 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.
Warning: both of these parsers are research code. But we're glad to have people using BUBS in the real world. If you give it a try, please contact me with problems, questions, comments, etc.
The free, java-based Stanford Dependency Parser (part of the Stanford Parser) does this trivially. It produces a dependency parse tree with dependencies such as nsubj(makes-8, Bell-1)
, telling you that Bell
is the subject of makes
. All you'd have to do is scan the list of dependencies the parser gives you looking for nsubj
or nsubjpass
entries and those are the subjects of verbs.