Message duplication #3065

davidich · 2021-05-25T16:53:16Z

davidich
May 25, 2021

I'm experiencing a message duplication as if it was possible to bind a queue twice to an exchange. I see a consumer ack rate twice as big as a publish rate while the redelivered rate is 0.
The image shows the problem symptoms before 8:22 (The green line is twice bigger than the yellow one - consumers get the same message twice at the same time).

Setup details:
RabbitMQ version: 3.8.14
One publisher -> Fanout exchange -> One quorum queue -> Four consumers.
Every consumer creates 10 channels with Prefetch Count = 5.

The problem occurs after consumer service restart (it is a docker swarm service with 4 instances). I noticed that the more consumers I have, the easier it to reproduce the issue. With 4 consumer instances, I can reproduce the issue 5 out of 10 times, while with 2 consumers I was able to reproduce it 2 out of 10 restarts. And with 1 consumer instance, I couldn't reproduce it after 10 restarts.

Some other oddities in RabbitMQ behavior I noticed during the issue:

once message duplication happens I always have stuck Unacked messages. The number of unacked messages doesn't change over time and the value is always a multiple of 50 (10 channels * 5 prefetch count). I observed this correlation every time I saw message duplication on the consumer side (10+ times).
when duplicate messages are received, they both have redelivered property = false (as if it was delivered for the first time).
in web management, all stuck unacked messages are shown only at the queue and connection level, but at the channels page, all Unacked counters are 0.

The problem was solved after I renamed exchange+queue. Basically, I appended v2 to the previous names of exchange and queue.
With a new set of exchange+queue I tried restarting my consumers 10 times and didn't see the issue. My "bad" pair of queue&exchange are still on the server and after I re-pointed my producer and consumer to the original queue, I was able to reproduce the issue 3 out of 5 restarts. So I believe this is something specific to a particular instance of a queue or exchange. Also, I believe this happened after we had an incident in our prod environment. One host server in the swarm cluster ran out of available memory. DevOps team just killed the app with a memory leak and RabbitMQ restored itself. At the same moment we started to see stuck unacked messages for the first time.
Also, DevOps team confirmed that the metadata of our Old and New exchange and Old and New queue in DETS file are identical.

Some additional details about Rabbit setup:
CONFIG:
loopback_users.admin = false
cluster_formation.peer_discovery_backend = classic_config
cluster_formation.node_cleanup.only_log_warning = true
cluster_formation.classic_config.nodes.1 = rabbit@rabbitmq-1
cluster_formation.classic_config.nodes.2 = rabbit@rabbitmq-2
cluster_formation.classic_config.nodes.3 = rabbit@rabbitmq-3
cluster_formation.classic_config.nodes.4 = rabbit@rabbitmq-4
cluster_partition_handling = autoheal
#Flow Control is triggered if memory usage above %80.
vm_memory_high_watermark.relative = 0.8
#Flow Control is triggered if free disk size below 5GB.
disk_free_limit.absolute = 5GB
prometheus.return_per_object_metrics = true
Plugins
[rabbitmq_prometheus,
rabbitmq_management,
rabbitmq_federation,
rabbitmq_federation_management,
rabbitmq_shovel,
rabbitmq_shovel_management].

Leader
rabbit@rabbitmq-1
Online
rabbit@rabbitmq-2
rabbit@rabbitmq-1
rabbit@rabbitmq-3
Members
rabbit@rabbitmq-2
rabbit@rabbitmq-1
rabbit@rabbitmq-3

Will be happy to provide any further details.

Thanks,
Alex

michaelklishin · 2021-05-25T17:00:25Z

michaelklishin
May 25, 2021
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

michaelklishin · 2021-05-25T17:03:46Z

michaelklishin
May 25, 2021
Maintainer

We need an executable way to reproduce.

It is possible to bind a queue to an exchange multiple times. It is also possible for a dead-lettered message (you have both a DLX configured and a delivery limit on this queue) to be almost immediately routed back to the same queue, in which case it will
be a brand new message with all metadata (e.g. the redelivery status) suggesting as much.

3 replies

michaelklishin May 25, 2021
Maintainer

There is a mechanism that makes it possible for you to inspect what messages are published (including dead lettering) on the node. A traffic capture allows you to inspect what is delivered to your consumers and what they do in response.

Consider the following scenario:

Your consumers start but cannot process deliveries just yet
They get a set of deliveries and intentionally or by failing, requeue them
Soon enough the redelivery limit of 2 is hit
The message is dead lettered in a way that makes it route back to the same queue as a brand new message without any previous delivery information
You have duplicates via a combination of consumer behaviour, the max redelivery limit, dead lettering and DLX routing

Disabling dead lettering and max delivery limit would be a good way to verify this hypothesis.

davidich May 25, 2021
Author

@michaelklishin I appreciate your prompt response. As a developer, I perfectly understand your need to have a way of reproducing the issue, but I don't know how I can provide it. My current understanding of the issue is that it is due to a specific instance of Exchange+Queue. I believe they were somehow corrupted at the moment when our host ran out of RAM. As we always programmatically create our exchanges and the queues, I'm pretty sure that renaming them and running my app again produced an identical queue. And as renaming fixed my issue I tend to think it has to be something related to a particular instance of the exchange or the queue. So the question is how can I reliably transfer you my corrupted instances?

It is possible to bind a queue to an exchange multiple times.

This is what the management plugin shows me and makes me think I don't have two bindings. Please, correct me if I'm wrong:

It is also possible for a dead-lettered message (you have both a DLX configured and a delivery limit on this queue) to be almost immediately routed back to the same queue, in which case it will be a brand new message with all metadata (e.g. the redelivery status) suggesting as much.

In our case, we have x-delivery-limit set to 2, so I assume, we should see a no-zero redelivered rate if we're experiencing message re-queueing of dead-lettered messages. Am I right?

"Firehose Tracer" & "Wireshark" tracing.

We run rabbit in Docker Swarm cluster and I'm not sure if we can use WireShark in our env, but a "Firehose" option seems feasible to me. Do you think it might be helpful to enable it and see one publish notification alongside 2 deliver notifications?

michaelklishin Jun 28, 2021
Maintainer

We need some evidence to work with. Seeing what is being published is critically important. "Two deliveries for one publish" is perfectly healthy: some exchanges (including direct ones) can route to more than one queue.

michaelklishin · 2021-07-01T09:02:33Z

michaelklishin
Jul 1, 2021
Maintainer

Moved to #3154 and hopefully answered there (both a workaround and a Shovel change to make this behavior at least significantly less likely).

0 replies

davidich · 2021-07-01T20:49:41Z

davidich
Jul 1, 2021
Author

Possible reasons for the issue @michaelklishin mentioned were:
- Multiple binding. But it doesn't correlate with the screenshot that shows queue biding.
- Routing to more than one queue. In our case we had only one queue and consumers were reading for it. At least logically it was one queue, but maybe under the hood Rabbit was creating multiple instances because we have a rabbit cluster. Please clarify what you meant.
- Messages are redelivred. We constantly see 0 redeliver rate. How is that possible?
- Shovel plugin. We don't have it installed on our instances.

Sorry, but I'm not following the logic here.
Could anybody point me in the right direction?

0 replies

lhoguin · 2021-07-02T06:41:09Z

lhoguin
Jul 2, 2021
Maintainer

It sounds like a legitimate bug. If there's no way to get a reproducible test case, another good information would be the state of the queue or the associated processes when that happens. I am not familiar with quorum queues though so I do not know what to ask for. @kjnilsson Thoughts?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message duplication #3065

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Message duplication #3065

davidich May 25, 2021

Replies: 5 comments · 3 replies

michaelklishin May 25, 2021 Maintainer

michaelklishin May 25, 2021 Maintainer

michaelklishin May 25, 2021 Maintainer

davidich May 25, 2021 Author

michaelklishin Jun 28, 2021 Maintainer

michaelklishin Jul 1, 2021 Maintainer

davidich Jul 1, 2021 Author

lhoguin Jul 2, 2021 Maintainer

davidich
May 25, 2021

Replies: 5 comments 3 replies

michaelklishin
May 25, 2021
Maintainer

michaelklishin
May 25, 2021
Maintainer

michaelklishin May 25, 2021
Maintainer

davidich May 25, 2021
Author

michaelklishin Jun 28, 2021
Maintainer

michaelklishin
Jul 1, 2021
Maintainer

davidich
Jul 1, 2021
Author

lhoguin
Jul 2, 2021
Maintainer