Pergunta

Given a list of "forbidden" words (substrings), an alphabet, and a desired output string length, how would I efficiently sample output strings containing no forbidden word?

For short output strings with few forbidden words, I would use simple rejection sampling. Pick a string (uniformly) with the specified alphabet and length, return that string if it contains no element of the forbidden list, try again otherwise.

If I use that algorithm for output lengths several times larger than the typical forbidden word, then the probability of rejection will be higher. (Most words are 2 or 3 characters long.)

Assume the requested output length is too long to enumerate and store every possible value. My alphabet size would be 16 to 36 characters, but solutions to large alphabets would be interesting to think about. (In which case I would call these things random sentences, forbidden n-grams, and dictionary words.)

My forbidden word list will have one hundred to one thousand strings. I would like to avoid solutions requiring expensive precomputation or lots of memory.


My first idea was to try to build a random string incrementally, in contrast to the all-or-nothing approach of straightforward rejection sampling. I doubt that my algorithm produces each possible output with equal probability.

The algorithm idea follows:

  1. Initialize a char buffer long enough to fit outlen characters.
  2. Pick a random letter of the alphabet and append it to the buffer.
  3. If the buffer ends with a forbidden word of length k, then remove the last k letters from the char buffer and go to 2.
  4. Otherwise, go to 2 if the buffer has less than outlen characters.
  5. Return the contents of the buffer if it is full.

Step 3 serves to rewind the algorithm, returning the char buffer to a previous legal state.

I understand that clearing the whole buffer in step 3 definitely would produce uniform output just like the straightforward rejection sampling method. However, the average number of rejections before the first valid output is generated will be the same.

I've gotten stuck trying to determine if my proposed algorithm is uniform. I have had no luck finding alternative algorithms either. I haven't yet looked at how this algorithm's performance would compare to basic rejection sampling.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
Não afiliado a cs.stackexchange
scroll top