Minimum and maximum of the last 1000 values of the changing list
题
I'm creating an iterative algorithm (Monte Carlo method). The algorithm returns a value on every iteration, creating a stream of values.
I need to analyze these values and stop the algorithm when say 1000
returned values are withing some epsilon
.
I decided to implement it calculation the max
and min
values of the last 1000
values, and then calculate the error
using this formula (max-min)/min
and compare it to epsilon
: error<=epsilon
. And if this condition is reached, stop the iterations and return the result.
The first hare-brained idea was to use a
list
andappend
new values to it, calculating themax
andmin
values for the last1000
values of it after each appending.Then I decided there is no use of keeping more that
1000
last values. So I remembered ofdeque
. It was a very good idea since the complexity on adding and deleting on both ends ofdeque
object isO(1)
. But it didn't solve the problem of needing to go through all the last 1000 values on each iteration to calculatemin
andmax
.Then I remembered there is the
heapq
module. It keeps the data in such a way as to efficiently return the smallest one at every moment. But I need both the smallest and the largest ones. Furthermore I need to preserve the order of the elements so that I can keep1000
last returned elements of the algorithm, and I don't see how I can achieve it withheapq
.
Having all those thoughts in mind I decided to ask here:
How can I solve this task the most efficiently?
解决方案
If you are free / willing to change your definition of error
, you might want to consider using the variance
instead of (max-min)/min
.
You can compute the variance incrementally.
True, using this method, you are not deleting any values from your stream -- the variance will depend on all the values. But so what? With enough values, the first few won't matter a great deal to the variance, and the variance of the average, variance/n
, will become small when enough values cluster around some fixed value.
So, you can choose to halt when the variance/n < epsilon
.
其他提示
As a refinement of @unutbu's excellent idea, you could consider using exponentially-weighted moving variance. It can be computed in O(1)
time per observation, requires O(1)
space, and has the advantage of automatically reducing the observation's weight as the observation gets older.
The following paper has the relevant formulae: link. See equations (140)-(143) therein.
Finally, you might want to work with the standard deviation instead of variance. It is simply the square root of variance, and has the advantage of having the same units as the original data. This should make it easier to formulate a meaningful stopping criterion.
How about numpy?
Just to compare the speed:
import numpy as np
a = range(1000)
b = np.arange(1000)
max(a) # 29.7us
b.max() # 7.29us
and you can write to this array infinitely:
i = 0
b = np.empty([1000]) + np.nan
your loop:
b[i % 1000] = value
i += 1
The values older than 1000 iterations will get overwritten. You get the minimum/maximum with np.nanmin(b)
and np.nanmax(b)
.
The idea behind the nan
is that you initialize this array with 1000 nan's and you overwrite them one after another. The nanmin
and nanmax
methods ignore these nan's.
I'm afraid I'm not in a position to provide a nice Python answer now, but I'll give you the outline of the data structure you'll need to use:
Keep your 1000 items in an FIFO queue. Keep pointers to the largest and smallest items in the queue. If one of these leaves the queue, search the queue for the new largest/smallest (Amortized cost dependent on your data). If a new largest/smallest value enters the queue, just update the pointer (O(1)). Assuming your data is converging, this should work well.
Create a subclass of deque that has minvalue and maxvalue properties. When adding or removing entries, compare them to the current min and maxvalues - then you only need to rescan the deque for min/max if the value you are removing is the current min or max. And when adding, just compare the new value with the current min and max, and update accordingly. This will optimize the scanning of your deque for min/max values.
You could use two fibonacci heaps. Adding values is in O(1), deleting is in O(log(n)). In your question you already suggest the heapq module. I am not sure what kind of heap it provides, but a normal one will also work very smoothly.
The problem that you can only extract the minimum from one heap but not the maximum can be solved by keeping two heaps. Since I do not know the heapq module, you either might be able to provide it your own comparison function, or you can just use -value
instead of value
for the key of the second heap.