I want to extract certain information from a large file using python.
I have 3 input files.
The first input file (input_file) is the data file, which is a 3-column tab-separated file that looks like this:
engineer-n imposition-n 2.82169386609e-05
motor-n imposition-n 0.000102011705117
creature-n imposition-n 0.000121321951973
bomb-n imposition-n 0.000680302090112
sedation-n oppression-n 0.000397074586994
roadblock-n oppression-n 5.96190620847e-05
liability-n oppression-n 0.012845281978
currency-n oppression-n 0.000793989880202
The second input file (colA_file) is a 1-column list, which looks like this:
bomb-n
sedation-n
roadblock-n
surrender-n
The third input file (colB_file) is also a 1-column list (idential to colA_file with different information), which looks like this:
adjective-n
homeless-n
imposition-n
oppression-n
I want to extract information from the input file that is found in both colA and colB.
With the example data that I have provided, this would mean filtering all of the information except for the following lines:
bomb-n imposition-n 0.000680302090112
sedation-n oppression-n 0.000397074586994
roadblock-n oppression-n 5.96190620847e-05
I have written the following code in Python to solve this task:
def test_fnc(input_file, colA_file, colB_file, output_file):
nounA = []
with open(colA_file, "rb") as opened_colA:
for aLine in opened_colA:
nounA.append(aLine.strip())
#print nounA
nounB = []
with open(colB_file, "rb") as opened_colB:
for bLine in opened_colB:
nounB.append(bLine.strip())
#print nounB
with open(output_file, "wb") as outfile:
with open(input_file, "rb") as opened_input:
for cLine in opened_input:
splitted_cLine = cLine.split()
#print splitted_cLine
if splitted_cLine[0] in nounA and splitted_cLine[1] in nounB:
outstring = "\t".join(splitted_cLine)
outfile.write(outstring + "\n")
test_fnc(input_file, colA_file, colB_file, output_file)
However, it only outputs 1-line, as if it is not iterating over the list inputs provided.
It also seems that my lists are being appended upon each other, starting with one item and incrementing itself with each appended item.
Thus, I have also tried to reference the lists as follows:
for bLine in opened_colB:
nounB = bLine
with the same result as above.