Sliding window with higher precision than diff(cumsum(…))

https://stackoverflow.com/questions/16844365

30-05-2022
|

Pergunta

What is the best R idiom to compute sums of elements within a sliding window?

Conceptually I want the following:

for (i in 1:(length(input) - lag + 1))
  output[i] <- sum(input[i:(i + lag - 1)])

In other words, every output element should be the sum of a fixed number of input elements (called lag here), resulting in an appropriately shorter result vector. I know that I can theoretically write this as

output = diff(cumsum(c(0, input)), lag = lag)

but I am worried about the precision here. I have a setup in mind where all the values would have the same sign, and the vectors would be pretty large. Summing up the values up front might lead to prety large numbers, so there won't be many significant digits left for the individual differences. This feels bad.

I would imagine that it should be possible to do better than that, at least when using a single function instead of two. An implementation could maintain the current sum, aleays adding one element and subtracting another for each iteration. Since that would still accumulate rounding errors along the way, one could perform the computations separately from both ends, and if the results at the center were too far off, compute a fresh result from the center and thus increase precision in a divide-and-conquer approach.

Do you know of any implementation which does anything like this?
Or is there a reason why this won't work as I think it should?
Or perhaps a reason why the diff(cumsum(…)) approach isn't as bad as it seems?

Edit: I had some off-by-one mistakes in my above formulations, making them inconsistent. Now they seem to agree on test data. lag should be the number of elements summed, and I'd expect a shorter vector as a result. I'm not dealing with time series objects, so absolute time alignment is not that relevant to me.

I had seen some noisy-looking things in my real data, which I had assumed to be due to such numeric problems. Since several different approaches to compute these values, using different suggestions from answers and comments, still led to similar results, it might be that the strangeness of my data is not in fact due to numeric issues.

So in order to evaluate answers, I used the following setup:

library(Rmpfr)
library(caTools)

len <- 1024*1024*8
lag <- 3
precBits <- 128
taillen <- 6

set.seed(42) # reproducible
input <- runif(len)
input <- input + runif(len, min=-1e-9, max=1e-9) # use >32 bits

options(digits = 22)

# Reference: sum everything separately using high precision.
output <- mpfr(rep(0, taillen), precBits = precBits)
for (i in 1:taillen)
  output[i] <- sum(mpfr(input[(len-taillen+i-lag+1):(len-taillen+i)],
                        precBits=precBits))
output

addResult <- function(data, name) {
  n <- c(rownames(resmat), name)
  r <- rbind(resmat, as.numeric(tail(data, taillen)))
  rownames(r) <- n
  assign("resmat", r, parent.frame())
}

# reference solution, rounded to nearest double, assumed to be correct
resmat <- matrix(as.numeric(output), nrow=1)
rownames(resmat) <- "Reference"

# my original solution
addResult(diff(cumsum(c(0, input)), lag=lag), "diff+cumsum")

# filter as suggested by Matthew Plourde
addResult(filter(input, rep(1, lag), sides=1)[lag:length(input)], "filter")

# caTools as suggested by Joshua Ulrich
addResult(lag*runmean(input, lag, alg="exact", endrule="trim"), "caTools")

The result for this looks as follows:

                               [,1]                    [,2]
Reference   2.380384891521345469556 2.036472557725210297264
diff+cumsum 2.380384892225265502930 2.036472558043897151947
filter      2.380384891521345469556 2.036472557725210741353
caTools     2.380384891521345469556 2.036472557725210741353
                               [,3]                    [,4]
Reference   1.999147923481302324689 1.998499369297661143463
diff+cumsum 1.999147923663258552551 1.998499369248747825623
filter      1.999147923481302324689 1.998499369297661143463
caTools     1.999147923481302324689 1.998499369297661143463
                               [,5]                    [,6]
Reference   2.363071143676507723796 1.939272651346203080180
diff+cumsum 2.363071143627166748047 1.939272651448845863342
filter      2.363071143676507723796 1.939272651346203080180
caTools     2.363071143676507723796 1.939272651346203080180

The result indicates that diff+cumsum is still surprisingly accurate. (It appeared even more accurate before I thought of adding that second runif vector.) filter and caTools both are almost indistinguishable from the perfect result. As for performance, I haven't tested that (yet). I only know that the Rmpfr cumsum with 128 bits was slow enough that I didn't feel like waiting on its completion. Feel free to edit this question if you have a performance benchmark, or new suggestions to add to the comparison.

Solução 2

This answer is based on the comment from Joshua Ulrich.

The package caTools provides a function runmean which computes my partial sum, divided by the window size (or rather the number of not-NA elements in the window in question). Quoting from its documentation:

In case of runmean(..., alg="exact") function a special algorithm is used (see references section) to ensure that round-off errors do not accumulate. As a result runmean is more accurate than filter(x, rep(1/k,k)) and runmean(..., alg="C") functions.

Note:

Function runmean(..., alg="exact") is based by code by Vadim Ogranovich, which is based on Python code (see last reference), pointed out by Gabor Grothendieck.

References:

About round-off error correction used in runmean: Shewchuk, Jonathan Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates

More on round-off error correction can be found at: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/393090

The code stores the sum of the current window using a sequence of double precision floating point values, where smaller values represent the round-off error incurred by larger elements. Therefore there shouldn't be any accumulation of rounding errors even if the input data is processed in a single pass, adding one element and removing another at each step. The final result should be as exact as double precision arithmetic can represent it.

Algorithms other than exact seem to yield somewhat different results, though, so I probably wouldn't suggest these.

It seems a bit unfortunate that the source code contains a runsum_exact function, but it is commented out. The division to obtain the mean, combined with the multiplication to get back to the sum, will introduce rounding errors which could have been avoided. To this the CHANGES file says:

11) caTools-1.11 (Dec 2010)

Fully retired runsum.exact, which was not working for a while, use runmean with "exact" option instead.

At the moment (caTools version 1.14 from 2012-05-22) the package appears to be orphaned.

Outras dicas

I can't speak to whether this is such an implementation, but there is

filter(input, sides=2, filter=rep(1, lag+1))

Looking at the body to filter, it looks like the hard work gets passed off to a C routine, C_rfilter, so perhaps you could examine this to see if it satisfies your precision requirements. Otherwise, @JoshuaUlrich's suggestion sounds promising.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow