Question

How do we parse data based on column index from TSV file? Once we read data from file then we must have to check column 0 line 1 data with column 0 line 2 data and if it's matching then get column 1 line 1 data and need to append all the matching entries in column 1 line 1.

For example, SystemType.tsv file

Actrius  1990s drama films 
Actrius  Catalan language films 
Actrius  Spanish films 
Actrius  Barcelona in fiction 
Actrius  Films directed by Ventura Pons 
Actrius  1996 films 
An_American_in_Paris     Compositions by George Gershwin 
An_American_in_Paris     Symphonic poems 
An_American_in_Paris     Grammy Hall of Fame Award recipients 

In column 0 line 1 "Actrius" is there so we need to compare all the lines in column 0 and placed matched entries column 1 value with comma separated form as below.

Output:

Actrius   1990s drama flims,Cataln language flims,Spanish flims,Barcelona in fiction,Films directed by Ventura Pons,1996 films
An_American_in_Paris    Compositions by George Gershwin,Symphonic poems,Grammy Hall of Fame Award recipients

I have tried this one but doesn't work for me.

def finalextract():
    lines_seen = set()
    outfile = open("Output.txt","w+")
    infile = open("SystemType.tsv","r+")
    for line in infile:
        if line[0] == lines_seen[0]:
            string = line[1]+','+lines_seen[1]
            outfile.write(string)
            lines_seen.add(string)
    infile.close()
    outfile.close()

Was it helpful?

Solution

Here's what I came up with (Python 3, but I think the only difference should be my print function. You can from __future__ import print_function if you want to use it to write to the output file):

import collections

# I used variable "input" to hold the string from your example .tsv contents;
# you'd really want to read it in from a file.

D = collections.OrderedDict()
for line in input.splitlines():
    key, value = line.split('\t')
    if key not in D:
        D[key] = []
    D[key].append(value.strip())

for key, values in D.items():
    print(key, ','.join(values), sep='\t')

My output is:

Actrius 1990s drama films,Catalan language films,Spanish films,Barcelona in fiction,Films directed by Ventura Pons,1996 films
An_American_in_Paris    Compositions by George Gershwin,Symphonic poems,Grammy Hall of Fame Award recipients
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top