Efficient way of inputting large raster data into PyTables

https://stackoverflow.com/questions/22250197

11-06-2023
|

题

I am looking for the efficient way to feed up the raster data file (GeoTiff) with 20GB size into PyTables for further out of core computation.

Currently I am reading it as numpy array using Gdal, and writing the numpy array into pytables using the code below:

import gdal, numpy as np, tables as tb

inraster = gdal.Open('infile.tif').ReadAsArray().astype(np.float32)
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=np.shape(inraster)
dataset[:] = inraster
dataset.flush()
dataset.close()
f.close()
inraster = None

Unfortunately, since my input file is extremely large, while reading it as numpy error, my PC shows memory error. Is there any alternative way to feed up the data into PyTables or any suggestions to improve my code?

解决方案

I do not have a geotiff file, so I fiddled around with a normal tif file. You may have to omit the 3 in the shape and the slice in the writing if the data to the pytables file. Essentially, I loop over the array without reading everything into memory in one go. You have to adjust n_chunks so the chunksize that gets read in one go does not exceed your system memory.

ds=gdal.Open('infile.tif')

x_total,y_total=ds.RasterXSize,ds.RasterYSize

n_chunks=100

f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=(3,y_total,x_total)


#prepare the chunk indices
x_offsets=linspace(0,x_total,n_chunks).astype(int)
x_offsets=zip(x_offsets[:-1],x_offsets[1:])
y_offsets=linspace(0,y_total,n_chunks).astype(int)
y_offsets=zip(y_offsets[:-1],y_offsets[1:])

for x1,x2 in x_offsets:
    for y1,y2 in y_offsets:
        dataset[:,y1:y2,x1:x2]=ds.ReadAsArray(xoff=x1,yoff=y1,xsize=x2-x1, ysize=y2-y1)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow