Pregunta

I've been thinking about how to implement something that, frankly, is beyond my mathematical skills. So here goes, feel free to try and point me in the right direction rather than complete code solutions any help I'd be grateful for.

So, imagine I've done an analysis of text and generated a table of the frequencies of different two-character combinations. I've stored these in a 26x26 array. eg.

  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 15 (frequency of AA, then frequency of AB etc.)
B 12 0 (freq of BA, BB etc..)
... etc.

So I want to randomly choose these two-character combinations but I'd like to 'weight' my choice based on the frequency. ie. the AB from above should be 15 times 'more likely' than AA. And, obviously, the selection should never return something like BB (ie. a frequency of 0 - in this example, obviously BB does occur in words like Bubble!! :-) ). For the 0 case I realise I could loop until I get a non-0 frequency but that's just not elegant because I have a feeling/intuition that there is a way to skew my average.

I was thinking to chose the first char of my pair - ie. the row - (I'm generating a 4-pair-sequence ultimately) I could just use the system random function (Random class.Next) then use the 'weighted' random algorithm to pick the second char.

Any ideas?

¿Fue útil?

Solución

Given your example sample, I would first create a cumulative series of all of the numbers (1, 15, 12, 0 => 1, 16, 28, 28).

Then I would produce a random number between 0 and 27 (let's say 19).

Then I would calculate that 19 was >=16 but <28, giving me bucket 3 (BA).

Otros consejos

There are some good suggestions in the other answers for your specific problem. To solve the general problem of "I have a source of random numbers conforming to a uniform probability distribution, but I would like it to conform to a given nonuniform probability distribution", then you can work out the quantile function, which is the function that performs that transformation. I give a gentle introduction that explains why the quantile function is the function you want here:

Generating Random Non-Uniform Data In C#

How about summing all the frequencies and using that from AA to ZZ to generate your pair.

Lets say you have a total frequency of pairs if the rnd return 0 you get AA if it returns 1-14 then its AB etc

Use your frequency matrix to generate a complete set of values. Order the set by Random.Next(). Store the randomized set in an array. Then you can just select an element out if that array based on Random.Next(randomarray.Length).

If there is a mathematical way to calculate the frequency you could do that as well. But creating a precompiled and cached set will reduce the calculation time if this is called repeatedly.

As a note, depending on the max frequency this could require a good amount of storage. You would also want to create the instance of random before you loop to build the set. This is so you don't reseed the random generator.

...

Another way (similar to what you suggested at the end of your question) would be to do this in two passes with the first selecting the row and the second used your weighted frequency to select the column. That would just be the sum of the row frequencies bounded over a ranges. The first suggestion should give a more even distribution based on weight.

Take the sum of the probabilities. Take a random number between zero and that sum. Add up the probabilities until you get it's greater than or equal to your random number. Then use the item your on.

Eg pseudocode:

b = getProbabilites()
s = sum(b)
r = randomInt() % s
i = 0
acc = 0
while (acc < r) {
    acc += b[i]
    i++
}

return i

If efficiency is not a problem, you could create a key->value hash instead of an array. An upside of this would be that (if you format it well in the text) it would be very easy to update the values should the need arise. Something like

{
    AA => 5, AB => 2, AC => 4,
    BA => 6, BB => 5, BC => 9,
    CA => 2, CB => 7, CC => 8
}

With this, you could easily retrieve the value for the sequence you want, and quickly find the entry to update. If the table is automatically generated and extremely large, it could help to get/be familiar with vim's use of regular expressions.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top