Saving (and averaging) Large Sparse Numpy Array

https://stackoverflow.com/questions/23337224

10-07-2023
|

Question

I have written a code that generates a large 3d numpy array of data observations (floats). The dimensions are (33,000 x 2016 x 53), which corresponds to (#obs.locations x 5min_intervals_perweek x weeks_in_leapyear). It is very sparse (about 1.5% of entries are filled).

Currently I do this by calling:

my3Darray = np.zeros(33000,2016,53)

my3Darray = np.empty(33000,2016,53)

My loop then indexes into the array one entry at a time and updates 1.5% with floats (this part is actually very fast). I then need to:

Save each 2D (33000 x 2016) slice as a CSV or other 'general format' data file
Take the mean over the 3rd dimension (so I should get a 33000 x 2016 matrix)

I have tried saving with:

for slice_2d_week_i in xrange(nweeks):
   weekfile = str(slice_2d_week_i)
   np.savetxt(weekfile, my3Darray[:,:,slice_2d_week_i], delimiter=",")

However, this is extremely slow and the empty entries in the output show up as

0.000000000000000000e+00

which makes the file sizes huge.

Is there a more efficient way to save (possibly leaving blanks for entries that were never updated?) Is there a better way to allocate the array besides np.zeros or np.empty? And how can I take the mean over the 3rd dimension while ignoring non-updated entries ( mean(my3Darray,3) does not ignore the 0 entries ).

Solution

You can save in one of numpy's binary formats, here's one I use: np.savez.

You can average with np.sum(a, axis=2) / np.sum(a != 0, axis=2). Keep in mind that this will still give you NaN's when there are zeros in the denominator.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow