Question

I'm doing a project on statistical machine translation in which I need to extract line numbers from a POS-tagged text file that match a regular expression (any non-separated phrasal verb with the particle 'out'), and write the line numbers to a file (in python).

I have this regular expression: '\w*_VB.?\sout_RP' and my POS-tagged text file: 'Corpus.txt'. I would like to get an output file with the line numbers that match the above-mentioned regular expression, and the output file should just have one line number per line (no empty lines), e.g.:

2

5

44

So far all I have in my script is the following:

OutputLineNumbers = open('OutputLineNumbers', 'w')
with open('Corpus.txt', 'r') as textfile:
    phrase='\w*_VB.?\sout_RP'
    for phrase in textfile: 

OutputLineNumbers.close()

Any idea how to solve this problem?

In advance, thanks for your help!

Was it helpful?

Solution

This should solve your problem, presuming you have correct regex in variable 'phrase'

import re

# compile regex
regex = re.compile('[0-9]+')

# open the files
with open('Corpus.txt','r') as inputFile:
    with open('OutputLineNumbers', 'w') as outputLineNumbers:
        # loop through each line in corpus
        for line_i, line in enumerate(inputFile, 1):
            # check if we have a regex match
            if regex.search( line ):
                # if so, write it the output file
                outputLineNumbers.write( "%d\n" % line_i )

OTHER TIPS

you can do it directly with bash if your regular expression is grep friendly. show the line numbers using "-n"

for example:

grep -n  "[1-9][0-9]" tags.txt

will output matching lines with the line numbers included at first

2569:vote2012
2570:30
2574:118
2576:7248
2578:2293
2580:9594
2582:577
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top