Weighted pick with filters [closed]

https://stackoverflow.com/questions/4170336

09-10-2019
|

Question

I have a list of elements with weights:

{ id1, weight1 },
{ id2, weight2 },
...
{ idN, weightN }

Weights are small integers (say, less than 1000, often less than 50). Total number of ids in the list is less than 1000 as well. (Each id is listed only once.)

For each query I have to return a "random enough" element from the list. If I do E queries, where E is proportional to the sum of all weights, the same number of times each element element must be exactly proportional to its weight value. Note that this should work for smaller values of E (say, up to 50 * sum of weights). See also note at the end of the question.

So far so good, I'd solve this task by putting element ids into a circular list, duplicating them the weight times, then shuffling the list. Each query returns head of the list, and then increments head position.

But in this case I have one additional condition:

I have additional parameter to the query: a filter. A filter is a map of id => is_enabled. If is_enabled is false for a given id, that id should be excluded from the results. The E value in the above restriction is calculated only for enabled elements. That is, disabled element weights are to be excluded from the query.

Filters are "unique" for each query and include entries for each id in the list. (Note that this implies 2^1000 potential filter values.)

Is there a way to solve this efficiently? I need the algorithm to be efficient on a multi-server cluster.

Note 1: I want to stress that, as I believe, selecting elements totally at random (as suggested in one of the answers), without storing any state, will not work. It will give exactly proportional number of elements only on infinite number of queries. Random number generator has full right to return unfair values for a long period of time.

Note 2: This task imposes no restrictions on the quality of the randomness. Come to think about it, it is not even necessary to shuffle the list in the simple-case solution above. Good randomness is better, but not necessary at all.

Note 3: Please note that 2^1000 potential filter values does mean that I can not store anything, associated with the filter value -- it will require too much memory. I can store something for the most recent (or often used) filters, but I can't store things like item list offset, as I can't afford to lose that data.

Note 4: We can't return metainformation with the query and let clients to store the state for us (good idea anyway, thanks, Diacleticus). We can't because two clients may accidentally use the same filter (some filters are more popular than others). In this case we must use the same state for both queries. In fact, client doing more than one query is a relatively rare event.

Solution 3

Perhaps I've found a solution:

Store id->number_of_queries_left, where initial value for number_of_queries_left is, say, weight * 10 (so the list is not refreshed too often -- exactly proportional requirement would be kept, I think).
On each query:
1. Pick a random id from filter, where is_enabled is true.
2. Decrement number_of_queries_left for that id.
3. If result is less than or equal to zero, mark that id as used and pick another one.
4. If all values are used and none found, reinitialize id->number_of_queries_left for all ids that are enabled in the filter ("recharge").

Looks like it should work. What do you think?

Update 1:

I'm worried that it looks like that I have to keep id->number_of_queries_left value separate for each filter value. I can't afford that due to memory restrictions (there are 2^1000 potential filter values). Am I right?

Can somebody help me to understand better the consequences of shared number_of_queries_left counter, please?

Update 2:

Credits for the idea go to Diacleticus (see comments to this answer).

What if we don't reset id->number_of_queries_left for all enabled items in the filter, but instead increment them by their respective weights? I think that this should fix the proportions. (Or should it?)

The only problem is that with this algorithm each number_of_queries_left counter can go very negative. (See above, we decrement it each time we want to look at its value.)

So, in a pessimistic case, even by incrementing all counters, we will not bring any of them above zero. This probably is OK, since we'll effectively just run the increment loop until any value will become positive.

Update 3:

No, we can't just run the increment loop until any value will become positive.

This will skewer the weights: that negative part does not have "physical sense" -- it does not represent values, returned from the query.

Thus, a hybrid approach:

When doing "recharge", increment each counter by weight + -min(0, current_counter_value). This should be done atomically, but that looks doable.

Still, I'm not sure that weight handling will be fair in this case.

Comments?

OTHER TIPS

It seems to me that you must keep a track for each different filter. This means that you must build a new shuffled list every time a new filter is introduced or when all elements are spent for old filter.

EDIT: Now that we work with proportional values we can remove the shuffled list altogether, and let statistics shuffle it for us. For each query set one counter to random(0..sum_of_all_enabled_weights_for_the_query). Go from start of the list, and subtract from this counter all weights that you come along if the element is enabled for the query, and just ignore it if it is disabled. If counter becomes negative then you found yourself an element.

Let's see if I understood your question.

I'll post the code in Mathematica step by step, and the commented output to follow it easily.

This answer provides a deterministic and ordered output (ie non-shuffling). If you really need a random permutation, you generate a full filtered sequence in advance with this same algorithm, shuffle it, and consume the values one by one.

The program

Fist we define two constants:

n = 10; (* nbr of ids *)
m = 3;  (* max weight - 1 *)

I keep the numbers small so we can check the output step by step.

Now we define a random { id, weight} table to work with. We use prime numbers as ids:

weights = Table[{Prime@k, RandomInteger[m] + 1}, {k, n}]

Output:

{{2, 3}, {3, 2}, {5, 3}, {7, 1}, {11, 1}, 
{13, 3}, {17, 1}, {19,4}, {23, 1}, {29, 2}}

Next we accumulate the weights values

accumulator = Accumulate[Table[k[[2]], {k, weights}]]

Output:

{3, 5, 8, 9, 10, 13, 14, 18, 19, 21}

And we merge both tables to get the accumulators into the id table:

weightsAcc = MapThread[Append, {weights, accumulator}]

Output:

{{2, 3, 3}, {3, 2, 5}, {5, 3, 8}, {7, 1, 9}, {11, 1, 10}, 
 {13, 3, 13}, {17, 1, 14}, {19, 4, 18}, {23, 1, 19}, {29, 2, 21}}

Now we initialize the filter, with your default values (true or false). I used True:

filter = Table[{k[[1]], True}, {k, weights}]

Output:

{{2, True}, {3, True}, {5, True}, {7, True}, {11, True}, {13, True}, 
 {17, True}, {19, True}, {23, True}, {29, True}}

The trick is to keep the filter synchronized with the ids vector, so we define a function to update the filter in that way:

updateFilter[filter_, newValuePair_] :=Return@
         ReplaceAll[filter, {newValuePair[[1]], x_} -> newValuePair];

And use it to change two values:

filter = updateFilter[filter, {2, False}];
filter = updateFilter[filter, {5, False}];
Print@filter

Output:

{{2,False},{3,True},{5,False},{7,True},{11,True},{13,True},
 {17,True},{19,True},{23,True},{29,True}}

Now we define our query. We'll use two global vars (agrhhhh!) and two functions to get the thing synchronized:

i = 1; j = 0; (* GLOBAL state variables *)

Adjustij[w_] := (                      (* parm w is weightsAcc *)
   j++;                                (* increment accumulator comparator*)
   If[j == w[[i, 3]], i++];            (* if current id exhausted, get next*)
   If[i == Length@w, i = 1; j = 0];    (* wraparound table when exhausted*)
);   

query[w_, filter_] :=                  (* parm w is weightsAcc *)
 (
  Adjustij[w];
  While[Not@filter[[i, 2]], Adjustij[w]];       (* get non filtered ids only *)
  Return[w[[i, 1]]];
  )

Of course the while loop could be accelerated just skipping the ids with filter False, but I think the intention is clearer this way.

Now we execute the query 30 times:

 Table[query[weightsAcc, filter], {30}]

and get:

{3, 3, 7, 11, 13, 13, 13, 17, 19, 19, 19, 19, 23, 3, 3, 7, 11, 13, \
 13, 13, 17, 19, 19, 19, 19, 23, 3, 3, 7, 11}

Which is our list (cyclically) with the proper weights, except those values with the filter in FALSE.

HTH!

Edit: Server and client code splitted to answer comments

It can process concurrent querys with different filters

The filter state is stored at the client.

Server-Implemented functions and code:

Clear["Global`*"];

(*Server Implemented  Functions follows*)

AdjustFilterState[fs_] := Module[{i, j}, (    (*fs = filterstate, i,j localvars*)
     i = fs[[1]]; (*local vars*)              (*w  = weights with accs*)
     j = fs[[2]];
     j++;                                     (* increment accumulator comparator*)
     If[j == weightsAcc[[i, 3]], i++];        (* if current id exhausted, get next*)
     If[i == Length@weightsAcc, i = 1; j = 0];(* wraparound table when exhausted*)
     Return[{i, j}];);
   ];


query[filter_, fs_] := Module[{fsTemp},       (*fs = filterstate*)
   (
    fsTemp = AdjustFilterState[fs];           (* local var *)

    While[Not@filter[[fsTemp[[1]], 2]],       (* get non filtered ids only *)
       fsTemp = AdjustFilterState[fsTemp]
    ];

    Return[{weightsAcc[[fsTemp[[1]], 1]], fsTemp}]; (*return[value,{filterState}]*)
   )
   ];

initFilter[] := masterFilter; (*Init filters to your defult vallue*)

(*The trick is to get the filter coordinated with the list value*)
updateFilter[f_, newValuePair_] :=
 Return@ReplaceAll[f, {newValuePair[[1]], x_} -> newValuePair];

(*Server Code - Just initialize the whole thing
   The SERVER stores ONLY the weights vectors and a master filter initialized*)

n = 10; (* nbr of ids *)                                (*init vars*)
m = 3;  (*max weight - 1 *)

weights = Table[{Prime@k, RandomInteger[m] + 1}, {k, n}]; (*random weights to test*)
accumulator = Accumulate[Table[k[[2]], {k, weights}]];    
weightsAcc = MapThread[Append, {weights, accumulator}];   (*add acummulator to list*)
masterFilter= Table[{k[[1]],True}, {k,weights}]; (* only ONE virgin filter in server*)

Client code:

(* Client Code 
   The CLIENT stores only the filter and the filterState*)
(* Set up filter and filterstate *)

filter = initFilter[];
filter = updateFilter[filter, {2, False}];  (*specify particular values*)
filter = updateFilter[filter, {5, False}];

filterState = {1,0}; (* these replace the previous GLOBAL state variables *)

ValuesList = {};  (*for storing results *)

Do[
 q1 = query[filter, filterState]; (* do the query *)
 AppendTo[ValuesList, q1[[1]]];   (* first element of return is the value *)
 filterState = q1[[2]];           (* second element is updated filter state *)
 , {30}  (*do 30 times*)
 ];
Print@ValuesList                 (* print results vector *)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow