rgdal efficiently reading large multiband rasters

https://stackoverflow.com/questions/4186507

10-10-2019
|

문제

I am working on an image classification script in R using the rgdal package. The raster in question is a PCIDSK file with 28 channels: a training data channel, a validation data channel, and 26 spectral data channels. The objective is to populate a data frame containing the values of each pixel which is not zero in the training data channel, plus the associated spectral values in the 26 bands.

In Python/Numpy, I can easily import all the bands for the entire image into a multi-dimensional array, however, due to memory limitations the only option in R seems to be importing this data block by block, which is very slow:

library(rgdal)

raster = "image.pix"
training_band = 2
validation_band = 1
BlockWidth = 500
BlockHeight = 500

# Get some metadata about the whole raster
myinfo = GDALinfo(raster)
ysize = myinfo[[1]]
xsize = myinfo[[2]]
numbands = myinfo[[3]]


# Iterate through the image in blocks and retrieve the training data
column = 0
training_data = NULL
while(column < xsize){
 if(column + BlockWidth > xsize){
  BlockWidth = xsize - column
 }
 row = 0
 while(row < ysize){
  if(row + BlockHeight > ysize){
   BlockHeight = ysize - row
  }

  # Do stuff here
  myblock = readGDAL(raster, region.dim = c(BlockHeight,BlockWidth), offset = c(row, column), band = c(training_band,3:numbands), silent = TRUE)
  blockdata = matrix(NA, dim(myblock)[1], dim(myblock)[2])
  for(i in 1:(dim(myblock)[2])){
   bandname = paste("myblock", names(myblock)[i], sep="$")
   blockdata[,i]= as.matrix(eval(parse(text=bandname)))
  }
  blockdata = as.data.frame(blockdata)
  blockdata = subset(blockdata, blockdata[,1] > 0)
  if (dim(blockdata)[1] > 0){
   training_data = rbind(training_data, blockdata)
  }

  row = row + BlockHeight
 }
 column = column + BlockWidth
}
remove(blockdata, myblock, BlockHeight, BlockWidth, row, column)

Is there a faster/better way of doing the same thing without running out of memory?

The next step after this training data is collected is to create the classifier (randomForest package) which also requires a lot of memory, depending on the number of trees requested. This brings me to my second problem, which is that creating a forest of 500 trees is not possible given the amount of memory already occupied by the training data:

myformula = formula(paste("as.factor(V1) ~ V3:V", dim(training_data)[2], sep=""))
r_tree = randomForest(formula = myformula, data = training_data, ntree = 500, keep.forest=TRUE)

Is there a way to allocate more memory? Am I missing something? Thanks...

[EDIT] As suggested by Jan, using the "raster" package is much faster; however as far as I can tell, it does not solve the memory problem as far as gathering the training data is concerned because it eventually needs to be in a dataframe, in memory:

library(raster)
library(randomForest)

# Set some user variables
fn = "image.pix"
training_band = 2
validation_band = 1

# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
training_class[training_class == 0] = NA
training_class = Which(training_class != 0, cells=TRUE)
training_data = extract(myraster, training_class)
training_data = as.data.frame(training_data)

So while this is much faster (and takes less code), it still does not solve the issue of not having enough free memory to create the classifier... Is there some "raster" package function that I have not found that can accomplish this? Thanks...

해결책

Check out the Raster package. The Raster package provides a handy wrapper for Rgdal without loading it into memory.

http://raster.r-forge.r-project.org/

Hopefully this help.

The 'raster' package deals with basic spatial raster (grid) data access and manipulation. It defines raster classes; can deal with very large files (stored on disk); and includes standard raster functions such as overlay, aggregation, and merge.

The purpose of the 'raster' package is to provide easy to use functions for raster manipulation and analysis. These include high level functions such as overlay, merge, aggregate, projection, resample, distance, polygon to raster conversion. All these functions work for very large raster datasets that cannot be loaded into memory. In addition, the package provides lower level functions such as row by row reading and writing (to many formats via rgdal) for building other functions.

By using the Raster package you can avoid filling your memory before using randomForest.

[EDIT] To solve the memory problem with randomForest maybe it helps if you could learn the individual trees within the random forest on subsamples (of size << n) rather than bootstrap samples (of size n).

다른 팁

I think the key here is this: " a data frame containing the values of each pixel which is not zero in the training data channel". If the resulting data.frame is small enough to hold in memory you could determine this by reading just that band, then trimming to only those non-zero values, then try to create a data.frame with that many rows and the total number of colums you want.

Can you run this?

 training_band = 2 

 df = readGDAL("image.pix",  band = training_band) 
 df = as.data.frame(df[!df[,1] == 0, ])

Then you could populate the data.frame's columns one by one by reading each band separately and trimming as for the training band.

If that data.frame is too big then you're stuck - I don't know if randomForest can use the memory-mapped data objects in "ff", but it might be worth trying.

EDIT: some example code, and note that raster gives you memory-mapped access but the problem is whether randomForest can use memory mapped data structures. You can possibly read in only the data you need, one band at a time - you would want to try to build the full data.frame first, rather than append columns.

Also, if you can generate the full data.frame from the start then you'll know if it should work. By rbind()ing your way through as your code does you need increasingly larger chunks of contiguous memory, and that might be avoidable.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow