Question

I have a huge dump of records in a file.

Filename     Col1   Col2   Col3  Col4
SE01_inf     name    []     NA    []
SE01_loc      NA    loc     NA    []
SE01_id       NA     []    123    []
SE01_1_inf   name1   []     NA    []
SE01_1_loc    NA     loc    NA    []

I want a consolidated output like below

Filename     Col1   Col2   Col3  Col4
SE01         name   loc    123    []
SE01_1       name1  loc     NA    []

I do not want to do it in excel as the data is huge and excel gets stuck the moment I write a function. Can I achieve this using python, I am not very clear on how to start.

Was it helpful?

Solution

How huge is the data? If memory isn't a problem and you have the data in a list this works for you example input:

input = [['SE01_inf', 'name', [], 'NA', []],\
         ['SE01_loc', 'NA', 'loc', 'NA', []],\
         ['SE01_id', 'NA', [], '123', []],\
         ['SE01_1_inf', 'name1', [], 'NA', []],\
         ['SE01_1_loc', 'NA', 'loc', 'NA', []]]

output = {}
for row in input:
    id = row[0][:row[0].rfind('_')]
    if id not in output:
        output[id] = [id] + row[1:]
    else:
        output[id] = [new if old in ('NA', []) else old for new, old in zip(row, output[id])]

print output.values()

Produces the output:

[['SE01', 'name', 'loc', '123', []], ['SE01_1', 'name1', 'loc', 'NA', []]]

OTHER TIPS

@Akinakes 's answer is excellent. Here are some tips in another way:

If you have

input = [['name', '[]', 'NA', '[]'],\
     ['NA', 'loc', 'NA', '[]'],\
     ['NA', '[]', '123', '[]']]

you can try

def fil(l):
    tl = filter(lambda x:x not in ('NA','[]'), l)
    if len(tl) == 0:
        tl = [l[-1]]
    t = str(*tl)
    return t

output = map(fil, zip(*input))

the output will be:

['name', 'loc', '123', '[]']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top