Random variable-length encoded numbers with uniform distribution

https://stackoverflow.com/questions/4815394

25-10-2019
|

Question

Suppose I have data presented with variable-length encoding when I can retrieve the data parsing some virtual b-tree and stopping when I reach the item (similar to Huffman encoding). There is unknown number of items (in the best case only the upper limit is known). Is there an algorithm to generate uniformly distributed numbers? The problem is that a coin-based algorithm will give non-uniform result in this case, for example if there's a number encoded as 101 and there's a number encoded 10010101, the latter will appear very rarely comparing to the former.

UPDATE: In other words, I have a set of maximum N elements (but maybe fewer) when every element can be addressed with arbitrary number of bits (and with accordance with informational theory, so if one is encoded 101 then no other element can be encoded with the same prefix). So it's more like B-Tree when I go left or right depending on a bit and at some moment I get to the data item. I want to get a sequence of random numbers addressed with this technique, but the distribution of them should be uniform (the example why choosing randomly left-right won't work is above, the numbers 101 and 10010101)

Thanks

Max

Solution

I can think of three basic methods, one of which involves frequent reguessing and one of which involves keeping extra information. I think that doing one or the other of these things is unavoidable. I'm going to begin with the extra information one:

In each node, store a number count which represents the number of descendants it has. For every node, you'll need to have a number between 1 and count for that node to tell you whether to go left or right by comparing it to the left child's count. Here's the algorithm:

n := random integer between 1 and root.count
node := route
while node.count != 1
     if n <= node.left.count
          node = node.left
     else
          node = node.right
          n = n - node.left.count

So, essentially, we're imposing a left-to-right ordering on all nodes and selecting the nth one from the left. This is fairly quick, only having a O(depth of tree), which is likely the best we can do without doing something like also building a vector which contains all the node labels. This also adds an overhead of O(depth of tree) to any changes to the tree since counts must be corrected. If you're going the other way and never changing the tree at all but going to be selecting random nodes a lot, just bit the bullet and put all of the node labels in a vector. That way you can select a random one in O(1) after O(N) initial set-up time.

If, however, you don't want to use up any storage space, here's an alternative with a lot of reguessing. First find a bound (which I'll label B) for the depth of the tree (we can use N-1 if needed, but obviously, that's a very loose bound. The tighter the bound which can be found, the faster the algorithm runs). Next we're going to generate a possible node label in a random, but even way. There are 2^(B+1)-1 possibilities. It's not just 2^B because, for example, the string "0011" and "11" are completely different strings. As a result, we need to count all possible binary strings of length between 0 and B. Obviously, we have 2^i strings of length i. So for strings of length i or less, we have sum(i=0 to B){2^i} = 2^(B+1)-1. So, we can just chose a number between 0 and 2^(B+1)-2 and then find the corresponding node label. Of course, the mapping from numbers to node labels isn't trivial, so I'll provide it here.

We convert the number we have chosen into a string of bits in the ordinary way. Then, reading from the left, if the first digit is a 0, then the node label is the remaining string to the right (possibly the empty string, which is a valid node label although not likely to be in use). If the first digit is a 1, then we throw it away and repeat this process. Thus, if B=4, then the node label "0001" would come from the number "00001". The node label "001" would come from the number "10001". The node label "01" would come from the number "11001". The node label "1" would come from the number "11101". And the node label "" would come from the number "11110". We did not include the number 2^(B+1)-1 ("11111" in this case) which has no valid interpretation under this scheme. I'll leave it as an exercise to the reader to prove to themselves that every string from length 0 to B can be represented under this scheme. Rather than trying to prove it, I'll just assert that it will work.

So now we have a node label. The next step is to see if that label exists by traversing the tree. If it does, we're done. If it doesn't, then choose a new number and start over (that's the reguessing part). It's likely to have to reguess a lot, since only a small fraction of legal node labels will be in use, but this won't skew the fairness, just increase the time.

Here's a pseudo-code version of this process in four functions:

function num_to_binary_list(n,digits) =
  if digits == 0 return ()
  if n mod 2 == 0 return 0 :: num_to_digits(n/2,digits-1)
  else return 1 :: num_to_digits((n-1)/2,digits-1)

function binary_list_to_node_label_list(l) =
  if l.head() == 0 return l.tail()
  else return binary_list_to_node_label_list(l.tail())

function check_node_label_list_against_tree(str,node) =
  if node == null return false,null
  if str.isEmpty() 
    if node.isLeaf() return true,node
    else return false,null
  if str.head() == 0 return check_node_label_list_against_tree(str.tail(),node.left)
  else check_node_label_list_against_tree(str.tail,node.right)

function generate_random_node tree b =
  found := false
  while (not found)
    x := random(0,2**(b+1)-2) // We're assuming that this random selects inclusively
    node_label := binary_list_to_node_label(num_to_binary_list(x,b+1))
    found,node := check_node_label_list_against_tree(node_label,tree)
  return node

The timing analysis for this, of course, is pretty horrendous. Basically, the while loop will run an average of (2^(B+1)-1)/N times. So, in the worst case, it's O((2^N)/N) which is terrible. In the best case, B would be on the order of log(N), so it would be roughly O(1), but that requires that the tree be fairly balanced which it may not be. Still, if you really want no extra space, this method does that.

I don't really think that you can do better than this last method without storing some information. It sounds appealing to be able to traverse the tree, making random decisions as you go, but without storing additional information about the structure, you're just not going to be able to do that. Every time you make a branching decision, you could have just one node on the left side and a million nodes on the right side or it could have a million nodes on the left side and just one on the right side. Because those are both possible and you don't know which is the case, there's simply no way to make an even random decision between the two sides. Obviously 50-50 doesn't work and any other choice is going to be similarly problematic.

So, if you don't want extra space, the second method will work, but be slow. If you don't mind adding some extra space, the first method will work and be fast. And, as I said earlier, if you're not going to be changing the tree and you'll be selecting a lot of random nodes, then bite the bullet and just traverse the tree and stick all leaf nodes in a self-growing array or vector and then pick from that.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow