Docs

    A queue consumer receives messages from an SQS queue. Issues with consumers are detected by monitoring queue metrics like messages received and deleted.

    Effects


    Lowered queue consumption can result in:

    • Delayed processing of data
    • Lost data if messages are never successfully processed

    There are three possible scenarios:

    1. Processing stopped: Messages are being sent to the queue but are no longer being read
    2. Processing behind: Messages are being sent to the queue faster than they are read
    3. Reprocessing messages: Messages are being read multiple times before deletion
      
     

    1. Processing Stopped


     

    When processing has completely stopped, then there must not be any consumers reading from the queue. Ensure that some consumers are running, that they are configured to read from the correct queue, and that they have correct permissions on that queue.

     

    2. Processing Behind


    Processing behind is when fewer messages are consumed than are being sent to the queue, but processing has not completely stopped. The quick fix is to add more consumers immediately to keep up with the rate of messages being sent.

    Look into increased traffic that would cause more messages to be sent to your queue. You can additionally investigate consumers being offline or not having access to read and delete messages from the queue. It is also highly recommended to read more than one message at a time from the queue (up to 10) if your application allows it and can process the messages fast enough.

    If your queue size is not constant, you should consider creating a Cloudwatch alarm on the number of visible messages in the queue. This alarm can then be used to trigger an action that increases the size of an autoscaling group, or launches more containers in ECS to deal with the increase in messages.

     

    3. Reprocessing Messages


    We detect reprocessing of messages by recognizing that more messages are being read from the queue than are being deleted. When messages are reprocessed, your infrastructure is wasting cycles on already-completed work. The most likely cause is the visibility timeout of the message being exceeded.

    In SQS, the visibility timeout refers to how long a consumer has to completely process the message before it becomes available to be read by other consumers. This timeout can be set as a default value on the entire queue or by the producer when sending messages. Exceeding the visibility timeout allows the message to be read repeatedly until it is deleted from the queue or exceeds the queue retention time. Messages that fail to be processed will eventually be sent to the dead letter queue (if one is configured).

    To quickly alleviate queue backup caused by message reprocessing, add more consumers to handle messages that are not being reprocessed. A long term fix will involve:

    • Figuring out how long you need to process messages.
    • Changing the visibility timeout of all messages to cover the longest time, or dynamically updating the visibility timeout of a message after it has been read.
    • Ensuring that you always delete messages from the queue after they are processed. Do not rely on the message retention to clean up messages in your queue.

    Lastly, make sure to check the AWS Status Page to see if SQS is experiencing increased error rates.

     

    Resources


     Note: Our system uses SQS metrics in Amazon CloudWatch to detect possible issues with consumers of your queue. Due to the API limitations of CloudWatch, there can be a delay of as many as 20 minutes before our system can detect these issues.