Extracting information from 2 separate lists

https://stackoverflow.com/questions/23540309

18-07-2023
|

Pergunta

I want to extract certain information from a large file using python. I have 3 input files. The first input file (input_file) is the data file, which is a 3-column tab-separated file that looks like this:

engineer-n imposition-n 2.82169386609e-05
motor-n imposition-n 0.000102011705117
creature-n imposition-n 0.000121321951973
bomb-n imposition-n 0.000680302090112
sedation-n oppression-n 0.000397074586994
roadblock-n oppression-n 5.96190620847e-05
liability-n oppression-n 0.012845281978
currency-n oppression-n 0.000793989880202

The second input file (colA_file) is a 1-column list, which looks like this:

bomb-n
sedation-n
roadblock-n
surrender-n

The third input file (colB_file) is also a 1-column list (idential to colA_file with different information), which looks like this:

adjective-n
homeless-n
imposition-n
oppression-n

I want to extract information from the input file that is found in both colA and colB. With the example data that I have provided, this would mean filtering all of the information except for the following lines:

bomb-n imposition-n 0.000680302090112
sedation-n oppression-n 0.000397074586994
roadblock-n oppression-n 5.96190620847e-05

I have written the following code in Python to solve this task:

def test_fnc(input_file, colA_file, colB_file, output_file):
    nounA = []
    with open(colA_file, "rb") as opened_colA:
        for aLine in opened_colA:
            nounA.append(aLine.strip())
            #print nounA

    nounB = []
    with open(colB_file, "rb") as opened_colB:
        for bLine in opened_colB:
            nounB.append(bLine.strip())
            #print nounB

    with open(output_file, "wb") as outfile:
        with open(input_file, "rb") as opened_input:
            for cLine in opened_input:
                splitted_cLine = cLine.split()
                #print splitted_cLine
                if splitted_cLine[0] in nounA and splitted_cLine[1] in nounB:
                    outstring = "\t".join(splitted_cLine)
                    outfile.write(outstring + "\n")

test_fnc(input_file, colA_file, colB_file, output_file)

However, it only outputs 1-line, as if it is not iterating over the list inputs provided. It also seems that my lists are being appended upon each other, starting with one item and incrementing itself with each appended item. Thus, I have also tried to reference the lists as follows:

    for bLine in opened_colB:
        nounB = bLine

with the same result as above.

Solução

I would use pandas or numpy if you don't mind the dependency. With a pandas.DataFrame you can then perform isin checks on its columns. Otherwise I'd recommend using sets since regex should be much slower. Something like this:

with open(colA_file, "rb") as file_h:
    noun_a = set(line.strip() for line in file_h)

with open(colB_file, "rb") as file_h:
    noun_b = set(line.strip() for line in file_h)

with open(output_file, "wb") as outfile:
    with open(input_file, "rb") as opened_input:
        for line in opened_input:
            split_line = line.split()
            if split_line[0] in noun_a and split_line[1] in noun_b:
                outfile.write(line)

Outras dicas

import re

nounA=[]
with open('col1.txt', "rb") as opened_colA:
    for aLine in opened_colA:
        nounA.append(aLine.strip())

patterns = [r'\b%s\b' % re.escape(s.strip()) for s in nounA]
col1 = re.compile('|'.join(patterns))
nounB=[]
with open('col2.txt', "rb") as opened_colA:
    for aLine in opened_colA:
        nounB.append(aLine.strip())

patterns = [r'\b%s\b' % re.escape(s.strip()) for s in nounB]
col2 = re.compile('|'.join(patterns))

with open('test1.txt', "rb") as opened_colA:
    for aLine in opened_colA:
        if col1.search(aLine):
            if col2.search(aLine):
                print aLine

# just write aline to your output file.

Explanation: first I am taking the all the words in colA and making a regular expression; similarly with col2. Now with that regular expression I am searching the input file and printing the result

'\b' is word boundary. If you're searching for a word 'cat' but it may find 'catch', '\b' is useful so to find only word 'cat'.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow