Finding a “subject” from an array of part of speech tags

https://stackoverflow.com/questions/10280239

02-06-2021
|

Question

I know this question is more of a grammar question however how do you determine a "subject" of a sentence if you have a array of Penn Treebank tokens like:

[WP][VBZ][DT][NN]

Is there any java library that can take in such tokens and determine which one is the subject? Or which ones?

Solution

I have been successfully classifying subjects for Portuguese using OpenNLP. I created a shallow parser tweaking a little the OpenNLP Chunker component.

You can use the existing OpenNLP models for pos tagging and chunking, but you will train a new chunk model that takes the PoS tags + chunk tags to classify subjects.

The data format to train the Chunker is based on Conll 2000:

He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
...

I then created a new corpus that looks like the following

He        PRP+B-NP  B-SUBJ
reckons   VBZ+B-VP  B-V  
the       DT+B-NP   O
current   JJ+I-NP   O
account   NN+I-NP   O
deficit   NN+I-NP   O
will      MD+B-VP   O
narrow    VB+I-VP   O

If you have access to Penn Treebank you can create such data by looking for subject nodes in the corpus. Maybe you can start with this Perl script used to generate the data for the CoNLL-2000 Shared Task.

The evaluation results for Portuguese are 87.07 % for precision, 75.48 % for recall, and 80.86 % for F1.

OTHER TIPS

The standard way to label syntactic units of a sentence, including the subject, is with a constituent parser. A constituent tree labels substrings of the input with syntactic labels. See http://en.wikipedia.org/wiki/Parse_tree for an example.

If such a structure looks like it would serve your needs, I'd recommend you grab an off-the-shelf parser and extract the relevant phrase(s) from the output.

Most parsers I'm aware of include part-of-speech (POS) tagging during parsing, but if you're confident in the POS labels you have, you could constrain the parser to use yours.

Note that constitent parsing can be quite expensive computationally. To my knowledge, all state-of-the-art constituent parsers run at 4-80 sentences per second, although you might be able to achieve higher speeds if you're willing to sacrifice some accuracy.

A couple recommendations (more details at Simple Natural Language Processing Startup for Java).

The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ). State-of-the-art accuracy and reasonably fast (3-5 sentences per second).

The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, giving up a bit of accuracy (about 1.5 points in F1-score for those who care) but improving efficiency to around 50-80 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.

Warning: both of these parsers are research code. But we're glad to have people using BUBS in the real world. If you give it a try, please contact me with problems, questions, comments, etc.

The free, java-based Stanford Dependency Parser (part of the Stanford Parser) does this trivially. It produces a dependency parse tree with dependencies such as nsubj(makes-8, Bell-1), telling you that Bell is the subject of makes. All you'd have to do is scan the list of dependencies the parser gives you looking for nsubj or nsubjpass entries and those are the subjects of verbs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow