Question

My data (the doc-topics output from a MALLET topic model) has the following shape:

0   000cbac90fcc47efad081a929d74ab4c    0   0.3571185904395461  19  0.3113396339935042  4   0.12325835735304397 10  0.10409710001928904 8   0.04929547912982593 1   0.026833654459159112    2   0.01605333272832067 6   0.0048677134975701405   11  0.0019476615596546681   17  0.001908921062167065    14  0.0016877826048433426   5   5.256520178505095E-4    12  2.6155079718746636E-4   15  2.1354275518902175E-4   16  1.4885824861864537E-4   18  1.0362646846573807E-4   3   9.802082006786611E-5    9   9.641623707017035E-5    13  9.348016118039801E-5    7   5.0625647445916464E-5   
1   002d2e8fbf8f40399fb52c12fe6d79a1    2   0.4893657941273363  19  0.31683989601254264 5   0.14250187050440621 4   0.020578111489117222    8   0.012792452390528172    10  0.012232805681991418    11  9.110846196785881E-4    17  8.170410538631972E-4    12  6.349213666458032E-4    15  5.18380595356186E-4 0   4.750799583739948E-4    6   3.65757686209981E-4 16  3.613572723378424E-4    14  2.8022169390809563E-4   1   2.5184798205700335E-4   18  2.5155594892638147E-4   3   2.3794809156182638E-4   9   2.3405292457801712E-4   13  2.2692552394855585E-4   7   1.228950766326711E-4    
2   0046e05d3731491da4d9bab51d6ea36a    16  0.652945776661391   8   0.07953971245258269 0   0.06617734607073089 19  0.059407148715209045    4   0.02302855863782211 5   0.019674895033989195    11  0.014047819199510685    17  0.01259778154009113 2   0.012346407436534609    10  0.012058449719318621    12  0.009789716972385079    15  0.0079927997057698  6   0.005639539660456337    14  0.004320678460350115    1   0.0038831902561877263   18  0.0038786874597068794   3   0.0036688708127993316   9   0.003608812064842682    13  0.0034989161965088794   7   0.001894892943813293    
etc...

Each row (representing a document) has an index column, an id column, and 40 more columns that represent the 20 topics present.

The shape I'm trying to get is one where each topic number is it's own column and rows contain just the index, id, and topic proportions, e.g.:

i   id                                  0                   1                   2
0   000cbac90fcc47efad081a929d74ab4c    0.3571185904395461  0.2346339935042     0.1884010001928904
etc...

It seems like this could be accomplished with pivots, but the topic numbers being out of order (sorted by relevance in rows) makes this a head scratcher...

How would one accomplish this?

Was it helpful?

Solution

So one way to do it is to use python slice operators to grab every other value in the line and zip them (along with the filename) into 3-tuples, e.g.:

data = []
malletOutput = open('doc-topics','r').readlines()

for line in malletOutput:
    line = line.split('\t')[1:-1] # slicing out useless leading index and trailing \n
    _id = line[0]

    tIndicies = map(int,line[1::2])
    tVals = map(float, line[2::2])
    topics = sorted(zip(tIndicies, tVals))
    topics = [t + tuple([_id]) for t in topics] # add _id
    for t in topics:
        data.append(t)

Now you have a list of 3-tuples which you can add to a dataframe. A simple pivot from there results in the desired shape:

df = pd.DataFrame(data)
df = df.pivot(index=2,columns=0,values=1)
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top