Probabilities of duplicate mail detection by comparing notes among servers

https://cs.stackexchange.com/questions/2263

16-10-2019
|

Pergunta

I have the following problem:

We want to implement a filtering strategy in e-mail servers to reduce the number of spam messages. Each server will have a buffer, and before sending an e-mail, it checks whether there is a duplicate of the same message in its own buffer and contacts k distinct neighboring servers at random to check whether the duplicate is in another buffer. In case any duplicate message is detected, it will be deleted as spam, otherwise it will be sent after all negative replies are received.

Let us assume that there are N mail servers, and that a spammer sends M copies of each spam mail. We assume that all copies are sent simultaneously and that each mail is routed to a mail server randomly.

Given M, N and k I need to find out the probabilities that no spam message is deleted (i.e. no server detects spam), all spam messages are deleted (all servers detect spam) and spam messages are deleted from at at least one server.

So far, I have used combinations without repetition to find out the cases that need to be taken into account for an M and N. Now I need to find out the probability that one server receives at least two copies of a message, but I am at complete loss. Could you please provide some insight into the problem?

Solução

If a given server receives $m \leq M$ copies, it does not receive $M-m$ copies. Also, there are many ways to pick $m$ messages out of $M$; you have to consider all of them. Furthermore, important assumptions are that

"randomly" means "uniformly at random", that is the probability that any given copy goes to any given server is $\frac{1}{N}$, and that
each copy is routed independently of its siblings, that is we have independent random events for the individual copies.

This is all you need to piece together the probability $\operatorname{Pr}(S_i = m)$ that server $i$ receives $m$ messages.

If you put the above together, you get

$\qquad \displaystyle \operatorname{Pr}(S_i = m) = \binom{M}{m} \left(\frac{1}{N}\right)^m \left(1 - \frac{1}{N}\right)^{M-m}$

for $S_i$ being the number of copies the $i$-th server receives, $1\leq i \leq N$. You should recognize this kind of probability weight.

If you have doubts that this is correct, run some simulations (not a proof!).

Now it is easy to compute $\operatorname{Pr}(S_i \geq 2) = 1 - \operatorname{Pr}(S_i = 1) - \operatorname{Pr}(S_i = 0)$.

When using the derived weight in other calcuations, keep in mind that the $S_i$ are not independent because $\sum_{i=1}^N S_i = M$. The underlying probability distribution is the well-studied multinomial distribution.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange