Question

Just imagine you have rather large dataset of 2.000.000 points randomly destributed over some polygonal space. Density function has to be measured any time from randomly chosen sample of 4.000 points. This process has to be repeated 200 times. My code doesn't cover this problem very well. Any suggestion how to improve the code.

# coord is SpatialPoints Object 
library(sp)
library(maptools)
library(map)

You can get the polygonial object from the following link: https://www.dropbox.com/sh/65c3rke0gi4d8pb/LAKJWhwm-l

germG <- readShapePoly("vg250_gem.shp")
coord <- spsample(germG, 2e06, "random") # this command needs some minutes to be done. 

# R is the number of simulations
R <- 200
M <- matrix(NA,R, 256)
ptm=proc.time()
for(r in 1:R) {
  ix <- sample(1:2e06,size=4000)
  Dg <- spDists(coord[ix])
  Dg <- as.vector(Dg[Dg!=0])
  kg <- density(Dg,bw="nrd0",n=256)
  M[r,] <- kg$y
}
runningtime = proc.time()-ptm   
cat("total run time (sec) =",round(runningtime[3],1),"\n") 

the upper code provides a total run time (sec) = 964.8 by using Core i3, 2.27Ghz, 4 processors and 4 Gb RAM. How to speed up the performance of this for-loop simulation? I'll be very thankful to all your comments, critics and suggestions.

Was it helpful?

Solution

Not really an answer, just some observations:

  1. If R= # iterations, and S= sample size per iteration (e.g., R=200 and S=4000), then your run time will be ~O(R × S2). So doubling runs and halving sample size will reduce run time by a factor of 2.
  2. The default distance metric in spDists(...) is Euclidean. If this is what you want, then you are much better off using the dist(..) function - it is much more efficient (see code below). If you want geographic distance (e.g., Great Circle) you have to use spDists(..., longlat=T).
  3. spDists(...) calculates the full distance matrix, not just the lower diagonal. This means that all non-zero distances show up twice, which has an effect on your densities. This is why M1 and M2 in the code below are not equal.
  4. For large S, profiling your code (using Rprof) shows that about 46% of the time is spent in density(...), and another 38% is spent in spDists(...). So this is one case where replacing the for loop with calls to apply, lapply etc. will not help much.
  5. There are other functions for calculating a geographic distance matrix - assuming that's what you want, but none were faster than what you are already using. I tried earth.dist(...) in the fossil package, distm(...) from the geosphere package, and rdist.earth(...) in the fields package, but none of these improved run times.

Code:

library(sp)
library(maptools)
germG <- readShapePoly("vg250_gem.shp")
coord <- spsample(germG, 1e4, "random") # Just 10,000 points...
R <- 200

# dist(...) and spDists(..., longlat=F) give same result
zz <- coord[sample(1e4,size=200)]
d1 <- spDists(zz)
d2 <- dist(zz@coords)
max(abs(as.matrix(d1)-as.matrix(d2)))
# [1] 0
# but dist(...) is much faster
M1 <- matrix(NA,R, 256)
set.seed(1)
system.time({
  for(r in 1:R) {
    ix <- sample(1e4,size=200)    # S = 200; test case
    Dg <- spDists(coord[ix])      # using spDists(...)
    Dg <- as.vector(Dg[Dg!=0])
    kg <- density(Dg,bw="nrd0",n=256)
    M1[r,] <- kg$y
  }
})
#    user  system elapsed 
#   11.08    0.17   11.28 

M2 <- matrix(NA,R, 256)
set.seed(1)
system.time({
  for(r in 1:R) {
    ix <- sample(1e4,size=200)    # S = 200; test case
    Dg <- dist(coord[ix]@coords)  # using dist(...)
    Dg <- as.vector(Dg[Dg!=0])
    kg <- density(Dg,bw="nrd0",n=256)
    M2[r,] <- kg$y
  }
})
# user  system elapsed 
# 2.67    0.03    2.73 

Edit In response to OP's request. Below is the profiling code with size=200.

R=200
M <- matrix(NA,R, 256)
Rprof("profile")
set.seed(1)
system.time({
  for(r in 1:R) {
    ix <- sample(1e4,size=200)    # S = 200; test case
    Dg <- spDists(coord[ix])      # using spDists(...)
    Dg <- as.vector(Dg[Dg!=0])
    kg <- density(Dg,bw="nrd0",n=256)
    M[r,] <- kg$y
  }
})
Rprof(NULL)
head(summaryRprof("profile")$by.total,10)
#                   total.time total.pct self.time self.pct
# "system.time"          11.52    100.00      0.02     0.17
# "spDists"               7.08     61.46      0.02     0.17
# "matrix"                6.76     58.68      0.24     2.08
# "apply"                 6.58     57.12      0.26     2.26
# "FUN"                   5.88     51.04      0.22     1.91
# "spDistsN1"             5.66     49.13      3.36    29.17
# "density"               3.18     27.60      0.02     0.17
# "density.default"       3.16     27.43      0.06     0.52
# "bw.nrd0"               1.98     17.19      0.00     0.00
# "quantile"              1.76     15.28      0.02     0.17

As S gets bigger, calculating density begins to dominate because the results must be sorted. You can run this code with ix <- sample(1e4,size=4000) to see it.

OTHER TIPS

You might find it is marginally faster to define a blank matrix DG ahead of time.

Beyond that, you may want to consider a multicored application, granting sufficient RAM space.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top