Incremental entropy computation

Question 1

I derived update formulas and algorithms for entropy and Gini index and made the note available on arXiv. (The working version of the note is available here.) Also see this mathoverflow answer.

For the sake of convenience I am including simple Python code, demonstrating the derived formulas:

from math import log
from random import randint

# maps x to -x*log2(x) for x>0, and to 0 otherwise 
h = lambda p: -p*log(p, 2) if p > 0 else 0

# update entropy if new example x comes in 
def update(H, S, x):
    new_S = S+x
    return 1.0*H*S/new_S+h(1.0*x/new_S)+h(1.0*S/new_S)

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
    S = S1+S2
    return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy(L) using only `update' function 
def test(L):
    S = 0.0 # sum of the sample elements
    H = 0.0 # sample entropy 
    for x in L:
        H = update(H, S, x)
        S = S+x
    return H

# compute entropy using the classic equation 
def entropy(L):
    n = 1.0*sum(L)
    return sum([h(x/n) for x in L])

# entry point 
if __name__ == "__main__":
    L = [randint(1,100) for k in range(100)]
    M = [randint(100,1000) for k in range(100)]

    L_ent = entropy(L)
    L_sum = sum(L)

    M_ent = entropy(M)
    M_sum = sum(M)

    T = L+M

    print "Full = ", entropy(T)
    print "Update = ", update(L_ent, L_sum, M_ent, M_sum)

Question 2

You could re-compute the entropy by re-computing the counts and using some simple mathematical identity to simplify the entropy formula

K = count.size();
N = count[0] + ... + count[K - 1];
H = count[0]/N * log2(count[0]/N) + ... + count[K - 1]/N * log2(count[K - 1]/N)
  = F * h
h = (count[0] * log2(count[0]) + ... + count[K - 1] * log2(count[K - 1]))
F = -1/(N * log2(N))

which holds because of log2(a / b) == log2(a) - log2(b)

Now given an old vector count of observations so far and another vector of new 200 observations called batch, you can do in C++11

void update_H(double& H, std::vector<int>& count, int& N, std::vector<int> const& batch)
{
    N += batch.size();
    auto F = -1/(N * log2(N));
    for (auto b: batch)
       ++count[b];
    H = F * std::accumulate(count.begin(), count.end(), 0.0, [](int elem) { 
        return elem * log2(elem);
    });
}

Here I assume that you have encoded your observations as int. If you have some kind of symbol, you would need a symbol table std::map<Symbol, int>, and do a lookup for each symbol in batch before you update the count.

This seems the quickest way of writing some code for a general update. If you know that in every batch only few counts actually change, you can do as @migdal does and keep track of the changing counts, subtract their old contribution to the entropy and add the new contribution.