Any solution for accelerating the reading of data from disk and converting them into numpy array for further processing?

https://stackoverflow.com/questions/22087044

18-10-2022
|

题

Is there any solution for accelerating the reading of raster data from disk and converting them into numpy array for further processing? I have been really tired since the following code takes number of days to reading (and converting into numpy array) the thousands of files.

import glob, gdal, numpy as np
tiff_files = glob.glob('*.tif')    
all_data = []
for f in tiff_files:
    data_open = gdal.Open(f)
    data_array = data_open.ReadAsArray().astype(np.float32) 
    all_data.append(data_array)

How can I apply multiprocessing for above case?

解决方案

This is not to hard as your tiff_files are already a list, an important question is Does order matter - do the results have to be in the same order as the original files. If not

from multiprocessing import Pool
from multiprocessing import cpu_count


def handle_tiff(some_file):
    data_open = gdal.Open(some_file)
    data_array = data_open.ReadAsArray().astype(np.float32) 
    return data_array

tiff_files = glob.glob('*.tif') 
p = Pool(cpu_count()- an_integer)
all_data = p.map(handle_tiff, tiff_files)

In the above code you can just use cpu_count without subtracting an integer.

In response to your question some_file is a path from the list tiff_files. Note so the p.map maps each item in the list tiff_files to the function handle_tiff and spawns some number of threads. The list is broken into discrete chunks and each chunk assigned to a different thread. Then the filepath's in each chunk are sequentially submitted to the function.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow