Question

I have a text file that look like : (from ipython ) cat path_to_file

0   0.25    truth fact 
1   0.25    train home find travel
........
199 0.25    video box store office

I have another list

vec = [(76, 0.04334748761500331),
 (128, 0.03697806086341099),
 (81, 0.03131634819532892),
 (1, 0.03131634819532892)]

Now i want to only grab the matching first column from vec with first column of text file and show 1,2nd columns of vec with 3rd column from text file as my output.

If i had text file in same format as vec, i could have used set(a) & set(b). But values in test file are tabbed spaced(that's what it looks like when doing following)

with open( path_to_file ) as f: lines = f.read().splitlines()

Output is :

['0\t0.25\ttruth fact lie
.........................
 '198\t0.25\tfan genre bit enjoy ',
 '199\t0.25\tvideo box store office  ']
Was it helpful?

Solution

Using NumPy:

import numpy as np
import numpy.lib.recfunctions as rfn

dtype = [('index', int), ('text', object)]
table = np.loadtxt(path_to_file, dtype=dtype, usecols=(0,2), delimiter='\t')

dtype = [('index', int), ('score', float)]
array = np.array(vec, dtype=dtype)

joined = rfn.join_by('index', table, array)

for row in joined:
      print row['index'], row['score'], row['text']

If you care a lot about performance you can use np.savetxt() to do the output too, but I thought it was easier to understand this way.

OTHER TIPS

Converting vec to a dict and splitting the lines using "\t" as the delimiter should work:

vecdict = dict(vec)

output = []
for l in open('path_to_file'):
    words = l.split('\t')
    key = float(words[0])
    if vecdict.has_key(key):
        output.append("%s %f %s"%(words[0], vecdict[key], ' '.join(words[2:])) )

output should then be a list of strings.

PS: If you have multiple delimiters or are not sure which it is you could either use repeated calls to split, or re, e.g.

print re.findall("[\w]+", "this has    multiple delimiters\tHere")

>> ["this", "has", "multiple", "delimiters", "Here"]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top