Question

I have a POS-tagged parallel corpus text file in which I would like to do word reordering, so that the "separable phrasal verb particle" will appear next to the "verb" of the phrasal verb ('make up a plan' instead of 'make a plan up') . This used for preprocessing in a statistical machine translation system. Here are some example lines from the POS-tagged text file:

  1. you_PRP mean_VBP we_PRP should_MD kick_VB them_PRP out_RP ._.
  2. don_VB 't_NNP take_VB it_PRP off_RP until_IN I_PRP say_VBP so_RB ._.
  3. please_VB help_VB the_DT man_NN out_RP ._.
  4. shut_VBZ it_PRP down_RP !_.

I would like to move all the particles (in the examples: out_RP, off_RP, out_RP, down_RP) right next to the closest preceding verb (i.e. the verb that in combination with the particle makes up the phrasal verb). Here's what the lines should looks like after having changed the word order:

  1. you_PRP mean_VBP we_PRP should_MD kick_VB out_RP them_PRP ._.
  2. don_VB 't_NNP take_VB off_RP it_PRP until_IN I_PRP say_VBP so_RB ._.
  3. please_VB help_VB out_RP the_DT man_NN ._.
  4. shut_VBZ down_RP it_PRP !_.

So far I've tried using python and regular expressions to sort the problem by using re.findall:

import re 

file=open('first100k.txt').read()
matchline3='\w*_VB.?\s\w*_DT\s\w*_NN\s\w*_RP'
wordorder1=re.findall(matchline3,file)
print wordorder1

This will find all the phrasal verbs in word order 1(see below), but that's as far as I've got since I can't figure out how to move the particle next to the verb. Any ideas how to solve this problem properly (not necessarily by using python and regex)? I would like to be able to search for all phrasal verbs and move the particles in the following word orders:

(The used tags are taken from the Penn Treebank tagset (http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html )(the x denotes an optional character in order to include all verb forms, and * denotes a wildcard word))

  1. *_VBx+*_DT+*_NN+*_RP
  2. *_VBx+*_DT+*_NNS+*_RP
  3. *_VBx+*_DT+*_.JJ+*_NN+*_RP
  4. *_VBx+*_DT+*_.JJ+*_NNS+*_RP

  5. *_VBx+*_PRP$+*_NN+*_RP

  6. *_VBx+*_PRP$+*_NNS+*_RP
  7. *_VBx+*_PRP$+*_.JJ+*_NN+*_RP
  8. *_VBx+*_PRP$+*_.JJ+*_NNS+*_RP

  9. *_VBx+*_NNP+*_RP

  10. *_VBx+*_JJ+*_NNP+*_RP

  11. *_VBx+*_NNPS+*_RP

  12. *_VBx+*_PRP+*_RP

In advance, thanks for your help!

Was it helpful?

Solution

I wouldn't recommend using regular expressions here. It's definitely not as intuitive as just iterating over each line after being split on whitespace, possibly rearranging the list, and finally joining. You can try something like this,

reordered_corpus = open('reordered_corpus.txt', 'w')
with open('corpus.txt', 'r') as corpus:
    for phrase in corpus:
        phrase = phrase.split()                 # split on whitespace
        vb_index = rp_index = -1                # variables for the indices
        for i, word_pos in enumerate(phrase):
            pos = word_pos.split('_')[1]        # POS at index 1 splitting on _
            if pos == 'VB' or pos == 'VBZ':     # can add more verb POS tags
                vb_index = i
            elif vb_index >= 0 and pos == 'RP': # or more particle POS tags
                rp_index = i
                break                           # found both so can stop
        if vb_index >= 0 and rp_index >= 0:     # do any rearranging
            phrase = phrase[:vb_index+1] + [phrase[rp_index]] + \
                     phrase[vb_index+1:rp_index] + phrase[rp_index+1:]
        reordered_corpus.write(' '.join(word_pos for word_pos in phrase)+'\n')
reordered_corpus.close()

Using this code, if corpus.txt reads,

you_PRP mean_VBP we_PRP should_MD kick_VB them_PRP out_RP ._.
don_VB 't_NNP take_VB it_PRP off_RP until_IN I_PRP say_VBP so_RB ._.
please_VB help_VB the_DT man_NN out_RP ._.
shut_VBZ it_PRP down_RP !_.

after running, reordered_corpus.txt will be,

you_PRP mean_VBP we_PRP should_MD kick_VB out_RP them_PRP ._.
don_VB 't_NNP take_VB off_RP it_PRP until_IN I_PRP say_VBP so_RB ._.
please_VB help_VB out_RP the_DT man_NN ._.
shut_VBZ down_RP it_PRP !_.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top