A queue consumer receives messages from an SQS queue. Issues with consumers are detected by monitoring queue metrics like messages received and deleted.
Lowered queue consumption can result in:
There are three possible scenarios:
When processing has completely stopped, then there must not be any consumers reading from the queue. Ensure that some consumers are running, that they are configured to read from the correct queue, and that they have correct permissions on that queue.
Processing behind is when fewer messages are consumed than are being sent to the queue, but processing has not completely stopped. The quick fix is to add more consumers immediately to keep up with the rate of messages being sent.
Look into increased traffic that would cause more messages to be sent to your queue. You can additionally investigate consumers being offline or not having access to read and delete messages from the queue. It is also highly recommended to read more than one message at a time from the queue (up to 10) if your application allows it and can process the messages fast enough.
If your queue size is not constant, you should consider creating a Cloudwatch alarm on the number of visible messages in the queue. This alarm can then be used to trigger an action that increases the size of an autoscaling group, or launches more containers in ECS to deal with the increase in messages.
We detect reprocessing of messages by recognizing that more messages are being read from the queue than are being deleted. When messages are reprocessed, your infrastructure is wasting cycles on already-completed work. The most likely cause is the visibility timeout of the message being exceeded.
In SQS, the visibility timeout refers to how long a consumer has to completely process the message before it becomes available to be read by other consumers. This timeout can be set as a default value on the entire queue or by the producer when sending messages. Exceeding the visibility timeout allows the message to be read repeatedly until it is deleted from the queue or exceeds the queue retention time. Messages that fail to be processed will eventually be sent to the dead letter queue (if one is configured).
To quickly alleviate queue backup caused by message reprocessing, add more consumers to handle messages that are not being reprocessed. A long term fix will involve:
Lastly, make sure to check the AWS Status Page to see if SQS is experiencing increased error rates.
Note: Our system uses SQS metrics in Amazon CloudWatch to detect possible issues with consumers of your queue. Due to the API limitations of CloudWatch, there can be a delay of as many as 20 minutes before our system can detect these issues.