Pergunta

If I have two different producers that could produce the same message for a Kafka broker, how can I ensure that only one of the two message occurrences gets processed?

Is the only way to have an input topic, then a consumer that dedupes and saves to an output topic?

Or could the relatively new "exactly-once" functionality’s be utilized somehow for this problem?

Foi útil?

Solução

Kafka used previously only an at-least-once message handling. From the Kafka's documentation:

When publishing a message we have a notion of the message being "committed" to the log. Once a published message is committed it will not be lost as long as one broker that replicates the partition to which this message was written remains "alive".

In that scenario, nothing prevented the same producer to write the same message multiple times, for example in case of some network errors or latencies.

The new exactly-once guarantee now makes sure that if a producer writes the same message multiple times, it will be committed only once to the log. This works using a sequence number to deduplicate redundant message batches.

However, this sequence number is unique only taken together with a PID, a unique number that can be mapped from a transactional id that uniquely identifies a producer in its configuration. The detailed flows are explained in plain details in this article.

This implies that Kafka will only deduplicate identitical messages from the same producer. It will not deduplicate messages sent by two different producers. So yes, to deduplicate across several producers, you'll need a consumer that finds the same data in different messages. And this consumer will act as producer either to an additional broker, or to an additional topic on the same broker.

Licenciado em: CC-BY-SA com atribuição
scroll top