Minimum and maximum of the last 1000 values of the changing list

https://stackoverflow.com/questions/7876453

11-02-2021
|

题

I'm creating an iterative algorithm (Monte Carlo method). The algorithm returns a value on every iteration, creating a stream of values.

I need to analyze these values and stop the algorithm when say 1000 returned values are withing some epsilon.

I decided to implement it calculation the max and min values of the last 1000 values, and then calculate the error using this formula (max-min)/min and compare it to epsilon: error<=epsilon. And if this condition is reached, stop the iterations and return the result.

The first hare-brained idea was to use a list and append new values to it, calculating the max and min values for the last 1000 values of it after each appending.
Then I decided there is no use of keeping more that 1000 last values. So I remembered of deque. It was a very good idea since the complexity on adding and deleting on both ends of deque object is O(1). But it didn't solve the problem of needing to go through all the last 1000 values on each iteration to calculate min and max.
Then I remembered there is the heapq module. It keeps the data in such a way as to efficiently return the smallest one at every moment. But I need both the smallest and the largest ones. Furthermore I need to preserve the order of the elements so that I can keep 1000 last returned elements of the algorithm, and I don't see how I can achieve it with heapq.

Having all those thoughts in mind I decided to ask here:

How can I solve this task the most efficiently?

解决方案

If you are free / willing to change your definition of error, you might want to consider using the variance instead of (max-min)/min.

You can compute the variance incrementally. True, using this method, you are not deleting any values from your stream -- the variance will depend on all the values. But so what? With enough values, the first few won't matter a great deal to the variance, and the variance of the average, variance/n, will become small when enough values cluster around some fixed value.

So, you can choose to halt when the variance/n < epsilon.

其他提示

As a refinement of @unutbu's excellent idea, you could consider using exponentially-weighted moving variance. It can be computed in O(1) time per observation, requires O(1) space, and has the advantage of automatically reducing the observation's weight as the observation gets older.

The following paper has the relevant formulae: link. See equations (140)-(143) therein.

Finally, you might want to work with the standard deviation instead of variance. It is simply the square root of variance, and has the advantage of having the same units as the original data. This should make it easier to formulate a meaningful stopping criterion.

How about numpy?

Just to compare the speed:

import numpy as np
a = range(1000)
b = np.arange(1000)

max(a) # 29.7us
b.max() # 7.29us

and you can write to this array infinitely:

i = 0
b = np.empty([1000]) + np.nan

your loop:
    b[i % 1000] = value
    i += 1

The values older than 1000 iterations will get overwritten. You get the minimum/maximum with np.nanmin(b) and np.nanmax(b).

The idea behind the nan is that you initialize this array with 1000 nan's and you overwrite them one after another. The nanmin and nanmax methods ignore these nan's.

I'm afraid I'm not in a position to provide a nice Python answer now, but I'll give you the outline of the data structure you'll need to use:

Keep your 1000 items in an FIFO queue. Keep pointers to the largest and smallest items in the queue. If one of these leaves the queue, search the queue for the new largest/smallest (Amortized cost dependent on your data). If a new largest/smallest value enters the queue, just update the pointer (O(1)). Assuming your data is converging, this should work well.

Create a subclass of deque that has minvalue and maxvalue properties. When adding or removing entries, compare them to the current min and maxvalues - then you only need to rescan the deque for min/max if the value you are removing is the current min or max. And when adding, just compare the new value with the current min and max, and update accordingly. This will optimize the scanning of your deque for min/max values.

You could use two fibonacci heaps. Adding values is in O(1), deleting is in O(log(n)). In your question you already suggest the heapq module. I am not sure what kind of heap it provides, but a normal one will also work very smoothly.

The problem that you can only extract the minimum from one heap but not the maximum can be solved by keeping two heaps. Since I do not know the heapq module, you either might be able to provide it your own comparison function, or you can just use -value instead of value for the key of the second heap.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow