Store and update huge (and sparse?) multi-dimensional array efficiently to count conditional probabilities

https://stackoverflow.com/questions/4413520

08-10-2019
|

Question

Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk) and n(jk) appears (where j,k,i are sucsessive words).

The naive approach would be to use a 3-D array (for n(i|jk)), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using tries (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.

What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)

(Of course I also need to count n(jk), but that's easy, because it's only 2-D :) The language of choice is C++ I guess.

Solution

C++ code:

struct bigram_key{
    int i, j;// words - indexes of the words in a dictionary

    // a constructor to be easily constructible
    bigram_key(int a_i, int a_j):i(a_i), j(a_j){}

    // you need to sort keys to be used in a map container
    bool operator<(bigram_key const &other) const{
        return i<other.i || (i==other.i && j<other.j);
    }
};

struct bigram_data{
    int count;// n(ij)
    map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}

map<bigram_key, bigram_data> trigrams;

The dictionary could be a vector of all found words like:

vector<string> dictionary;

but for better lookup word->index it could be a map:

map<string, int> dictionary;

When you read a new word. You add it to the dictionary and get its index k, you already have i and j indexes of the previous two words so then you just do:

trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;

For better performance you may search for bigram only once:

bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;

Is it understandable? Do you need more details?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow