How to perform a running (moving) standardization for feature scaling of a growing dataset?

https://datascience.stackexchange.com/questions/77470

12-12-2020
|

Question

Let's say that there is a function $r$

$r_n = r(\tau_n)$,

where $n$ denotes a so-called time-step of a system with an evolving state. Both $\rho$ and $\tau$ should equally influence $r$, and should therefore be scaled. The problem is, the sequence $(\tau_1, \tau_2, \dots, \tau_n)$ grows in time because $n$ grows.

How to perform a running standardization of $(\tau_1, \tau_2, \dots, \tau_n)$. Running mean is relatively simple to express:

$\text{mean}(\tau)_{n+1} = \frac{1}{n+1}\left[\tau_{n+1} + N \text{mean}(\tau)_n\right]$

where $\text{mean}(\tau)_1 = \tau_1$.

The standardization requires

$\tilde{\tau}_n = \dfrac{\tau_n - \text{mean}(\tau)_n}{\sigma(\tau)_n}$

where

$\sigma(\tau)_n = \sqrt{\dfrac{1}{n-1}\sum_{i=1}^{n}[\tau_i - \text{mean}(\tau)_n]}$ (1)

is the standard deviation of $(\tau_1, \tau_2, \dots, \tau_n)$.

Question: is there an expression for a running standard deviation? Online I've only found links on stack overflow and Matlab functions, but I am not sure which algorithm is best suited for feature scaling. By running (moving) I mean not having to store $(\tau_1, \tau_2, \dots , \tau_n)$ to calculate (1), instead update it incrementally.

Solution

I think you want $$S_{n}=S_{n−1}+ (x_{n}−μ_{n−1})(x_{n}−μ_{n})$$ where S, x and μ are respectively the variance, value and mean.

See https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf for explanation and derivation.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange