Read a term-document matrix from csv using python

https://stackoverflow.com/questions/16446357

14-04-2022
|

Frage

The reason classic csv reader doesn't work on term-document arrays is that the first column of the csv file are terms, not values. Thus the file has the following syntax:

"";"label1";"label2";"label3" ...
"term1";1;0;8;...
"term2";0;0;3;...
.................................

I need to build a dictionary whose keys are label1, label3, etc... and values are the column vectors (here it would be: dict[label1]-> 1,0 , dict[label2] -> 0,0 etc), meaning that the terms are completely useless to me.

I have implemented a custom solution which goes something like this:

....
keys = f.readline().split('";"') #1st line of the csv
keys = keys[1:]                  #skipping ""
zeros = [0] * len(keys)          #dicts initial values will be 0
d = OrderedDict(zip(keys, zeros))
lines = f.readlines()
for line in lines:
    ...
    splittting, stripping etc I get a list with values (eg: 1,0,8 - see example above)
    ...
    for value in values:
        ....

However reading 8 csv files (total: 12MB) takes over 90 minutes with my laptop.

Does anyone know a more efficient way to deal with this?

Lösung

You could use the csv module anyway to read the CSV files into memory, then transpose the rows using zip(*rows) or itertools.izip(*rows):

with open(somecsv, 'rb') as infile:
    reader = csv.reader(infile, delimiter=';')
    headers = next(reader)
    data = list(reader)
    data = dict(zip(headers, zip(*data)))

This creates a data dictionary with the headers as keys and the columns as values. You can delete the '' 'terms' column from the dictionary if needed.

For your input example, the data dictionary looks like this after executing the above code:

{'': ('term1', 'term2'), 'label1': ('1', '0'), 'label2': ('0', '0'), 'label3': ('8', '3')}

Andere Tipps

pandas is clearly the way to go! All you have to do is load the dataframe into a dictionary and it makes one. Here is all the code, it's quick and efficient:

import pandas as pd
data = pd.read_csv(filename)
my_dict = dict(data)

quick and easy!

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow