From a large text file to a sparse matrix with Python

https://stackoverflow.com/questions/22839386

27-06-2023
|

Domanda

I am trying to find an efficient way to read a very large text file (about 2,000,000 lines). About 90% of these lines (the last 90% actually) have a three-column format and are used for storing a sparse matrix.

Here is what I did. First of all, I deal with the first 10% of the file:

i=1
cpt=0
skip=0
finnum=0
indice=1 
vec=[]
mat=[]
for line in fileinput.input("MY_TEXT_FILE.TXT"):
if i==1:
    # skipping the first line
    skip = 1
if (finnum == 0)and(skip==0):
    # special reading operation for the first 10% (approximately)
    tline=shlex.split(line)
    ind_loc=0
    while ind_loc<len(tline):
    if (int(tline[ind_loc])!=0):
            vec.append(int(tline[ind_loc]))
        ind_loc=ind_loc+1   
if (finnum == 1)and(skip==0):
    print('finnum = 1')
    h=input()    
        break       
    if (' 0' in line):
    finnum = 1
if skip == 0:
    i=i+1
else:
    skip=0
    i=i+1
cpt=cpt+1

Then I extract the remaining 90% into a list:

matrix=[]
with open('MY_TEXT_FILE.TXT') as f:
for i in range(cpt):
    f.next()
for line in f:
    matrix.append(line)

This allows for a very fast read through of the text file with low memory consumption. The drawback is that matrix is a list of strings, each string being something like:

>>> matrix[23]
'           5          11  8.320234929063493E-008\n'

I have tried to use an iterative procedure over the lines of matrix combined with the shlex.split command to go from a list of strings to an array but this is extremely time consuming.

Would you be aware of fast strategies to go from a list of strings to an array ?

What I would like to know is if there is something faster than this procedure :

A=[0]*len(matrix)
B=[0]*len(matrix)
C=[0]*len(matrix)
for i in range(len(matrix)):
     line = shlex.split(matrix[i])
     A[i]=float(line[0])
     B[i]=float(line[1])
     C[i]=float(line[2])

Alain

Soluzione

Look, I came up with this mixed solution that seems to work way faster. I created a 1 million sample random data like the one you mentioned above and timed your code. It took 77 seconds in my Mac computer which is a super fast computer by the way. Using numpy to split the string instead of shlex ended up in a 5 seconds processing process.

A=[0]*len(matrix)
B=[0]*len(matrix)
C=[0]*len(matrix)
for i in range(len(matrix)):
    full_array = np.fromstring(matrix[i], dtype=float, sep=" ")
    A[i]=full_array[0]
    B[i]=full_array[1]
    C[i]=full_array[2]

I made a couple of tests and it seems to work well and it's 14 times faster. I hope it helps.

Altri suggerimenti

When you are working with this large amount of numerical data, you should really be working with Numpy, not with pure python. This is typically more than a factor 10 faster and gives you access to Matlab style complicated calculations. I don't have time now to convert your code (and it would be easiest to have a sample file), but for sure reading the second part of your file can be done fast and efficiently using numpy.loadtxt. The whole second part of your code for skipping the first part and converting to float can probably be done with something like this:

A, B, C = np.loadtxt('MY_TEXT_FILE.TXT', skiprows = cpt, unpack = True)

You might want to play with the data format (by adding dtype = (int, int, float) or so, don't know exactly how to do this), since I guess the first two columns are integers.

Also note that numpy has a sparse matrix datatype available.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow