Memory efficient Python batch processing

https://stackoverflow.com/questions/13397761

29-11-2021
|

Question

question

I wrote a small python batch processor, that loads binary data, performs numpy operations and stores the results. It consumes much more memory, than it should. I looked at similar stack-overflow discussions and would like to ask for further recommendations.

background

I convert spectral data to rgb. The spectral data is stored in a Band Interleaved by Line (BIL) Image File. That is why I read and process the data line by line. I read the data using the Spectral Python Library, which returns a numpy arrays. hyp is a descriptor of a large spectral file : hyp.ncols=1600, hyp.nrows=3430, hyp.nbands=160

code

import spectral
import numpy as np
import scipy

class CIE_converter (object):
   def __init__(self, cie):
       self.cie = cie

    def interpolateBand_to_cie_range(self, hyp, hyp_line):
       interp = scipy.interpolate.interp1d(hyp.bands.centers,hyp_line, kind='cubic',bounds_error=False, fill_value=0)
       return interp(self.cie[:,0])

    #@profile
    def spectrum2xyz(self, hyp):
       out = np.zeros((hyp.ncols,hyp.nrows,3))
       spec_line = hyp.read_subregion((0,1), (0,hyp.ncols)).squeeze()
       spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
       for ii in xrange(hyp.nrows):
          spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()
          spec_line_int = self.interpolateBand_to_cie_range(hyp,spec_line)
          out[:,ii,:] = np.dot(spec_line_int,self.cie[:,1:4])
       return out

memory consumption

All the big data is initialised outside the loop. My naive interpretation was that the memory consumption should not increase (Have I used too much Matlab?) Can someone explain me the increase factor 10? This is not linear,as hyp.nrows = 3430. Are there any recommendations to improve the memory management?

  Line #    Mem usage    Increment   Line Contents
  ================================================
  76                                 @profile
  77     60.53 MB      0.00 MB       def spectrum2xyz(self, hyp):
  78    186.14 MB    125.61 MB           out = np.zeros((hyp.ncols,hyp.nrows,3))
  79    186.64 MB      0.50 MB           spec_line = hyp.read_subregion((0,1), (0,hyp.ncols)).squeeze()
  80    199.50 MB     12.86 MB           spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
  81                             
  82   2253.93 MB   2054.43 MB           for ii in xrange(hyp.nrows):
  83   2254.41 MB      0.49 MB               spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()
  84   2255.64 MB      1.22 MB               spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
  85   2235.08 MB    -20.55 MB               out[:,ii,:] = np.dot(spec_line_int,self.cie[:,1:4])
  86   2235.08 MB      0.00 MB           return out

notes

I replaced range by xrange without drastic improvement. I'm aware that a cubic interpolation is not the fastest, but this is not about CPU consumption.

No correct solution

OTHER TIPS

Thanks for the comments. They all helped me to improve the memory consumption a little. But eventually I figured out what the main reason for the Memory consumption was/is:

SpectralPython Images contain a Numpy Memmap object. This has the same format, as the data structure of the hyperspectral data cube. (in case of a BIL format (nrows, nbands, ncols)) When calling:

spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()

the image is not only returned as a numpy array return value, but also cached in hyp.memmap. A second call would be faster, but in my case the memory just increases until the OS is complaining. As the memmap is actually a great implementation, I will take direct advantage of it in future work.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow