Not really an answer, just some observations:
- If R= # iterations, and S= sample size per iteration (e.g., R=200 and S=4000), then your run time will be ~O(R × S2). So doubling runs and halving sample size will reduce run time by a factor of 2.
- The default distance metric in
spDists(...)
is Euclidean. If this is what you want, then you are much better off using thedist(..)
function - it is much more efficient (see code below). If you want geographic distance (e.g., Great Circle) you have to usespDists(..., longlat=T)
. spDists(...)
calculates the full distance matrix, not just the lower diagonal. This means that all non-zero distances show up twice, which has an effect on your densities. This is why M1 and M2 in the code below are not equal.- For large S, profiling your code (using
Rprof
) shows that about 46% of the time is spent indensity(...)
, and another 38% is spent inspDists(...)
. So this is one case where replacing the for loop with calls to apply, lapply etc. will not help much. - There are other functions for calculating a geographic distance matrix - assuming that's what you want, but none were faster than what you are already using. I tried
earth.dist(...)
in thefossil
package,distm(...)
from thegeosphere
package, andrdist.earth(...)
in thefields
package, but none of these improved run times.
Code:
library(sp)
library(maptools)
germG <- readShapePoly("vg250_gem.shp")
coord <- spsample(germG, 1e4, "random") # Just 10,000 points...
R <- 200
# dist(...) and spDists(..., longlat=F) give same result
zz <- coord[sample(1e4,size=200)]
d1 <- spDists(zz)
d2 <- dist(zz@coords)
max(abs(as.matrix(d1)-as.matrix(d2)))
# [1] 0
# but dist(...) is much faster
M1 <- matrix(NA,R, 256)
set.seed(1)
system.time({
for(r in 1:R) {
ix <- sample(1e4,size=200) # S = 200; test case
Dg <- spDists(coord[ix]) # using spDists(...)
Dg <- as.vector(Dg[Dg!=0])
kg <- density(Dg,bw="nrd0",n=256)
M1[r,] <- kg$y
}
})
# user system elapsed
# 11.08 0.17 11.28
M2 <- matrix(NA,R, 256)
set.seed(1)
system.time({
for(r in 1:R) {
ix <- sample(1e4,size=200) # S = 200; test case
Dg <- dist(coord[ix]@coords) # using dist(...)
Dg <- as.vector(Dg[Dg!=0])
kg <- density(Dg,bw="nrd0",n=256)
M2[r,] <- kg$y
}
})
# user system elapsed
# 2.67 0.03 2.73
Edit In response to OP's request. Below is the profiling code with size=200.
R=200
M <- matrix(NA,R, 256)
Rprof("profile")
set.seed(1)
system.time({
for(r in 1:R) {
ix <- sample(1e4,size=200) # S = 200; test case
Dg <- spDists(coord[ix]) # using spDists(...)
Dg <- as.vector(Dg[Dg!=0])
kg <- density(Dg,bw="nrd0",n=256)
M[r,] <- kg$y
}
})
Rprof(NULL)
head(summaryRprof("profile")$by.total,10)
# total.time total.pct self.time self.pct
# "system.time" 11.52 100.00 0.02 0.17
# "spDists" 7.08 61.46 0.02 0.17
# "matrix" 6.76 58.68 0.24 2.08
# "apply" 6.58 57.12 0.26 2.26
# "FUN" 5.88 51.04 0.22 1.91
# "spDistsN1" 5.66 49.13 3.36 29.17
# "density" 3.18 27.60 0.02 0.17
# "density.default" 3.16 27.43 0.06 0.52
# "bw.nrd0" 1.98 17.19 0.00 0.00
# "quantile" 1.76 15.28 0.02 0.17
As S gets bigger, calculating density begins to dominate because the results must be sorted. You can run this code with ix <- sample(1e4,size=4000)
to see it.