AWS Lambdas: WTF

2022-12-15 18:14:27 -0800

I've used AWS's SQS at several companies now. In general, it's a pretty reliable and performant message queue.

Previously, I'd used SQS queues in application code. A typical application asks for 1-10 messages from the SQS API, receives the messages, processes them, and marks them as completed, which removes them from the queue. If the application fails to do so within some timeout, it's assumed that the application has crashed/rebooted/etc, and the messages go back onto the queue, to be later fetched by some other instance of the application.

To avoid infinite loops (say, if you've got a message that is actually causing your app to crash, or otherwise can't be properly processed), each message has a "receive count" property associated with it. Each time the message is fetched from the queue, its receive count is incremented. If a message is not processed by the time the "maximum receive count" is reached, instead of going back onto the queue, it gets moved into a separate "dead-letter queue" (DLQ) which holds all such messages so they can be inspected and resolved (usually manually, by a human who got alerted about the problem).

That generally works so well that today we were quite surprised to find that some messages were ending up in our DLQs despite the fact that the code we had written to handle said messages was not showing any errors or log messages about them. After finally pulling in multiple other developers to investigate, one of them finally gave us the answer, and it came down to the fact that we're using Lambdas as our message processor.

So here's the issue, which you'll run into if:

you use a lambda function to process SQS messages
you set a reserved concurrency to limit that lambda's concurrency

Whatever Amazon process feeds SQS messages into that lambda will fetch too many messages. (I'm not sure if there's a way to tell if it was in a large batch, or lots of individual fetches in parallel, but either way the result is the same.)

Every time it does this, it increments the messages' receive counts. And of course when they reach their max receive count, they go to the DLQ, without your code ever having seen them.

This happens outside of your control and unbeknownst to you. So when you get around to investigating your DLQ you'll be scratching your head trying to figure out why messages are in there. And there's no configuration you can change that fixes it. Even if you set the SQS batch size for the lambda to 1.

If you think you might be running into this problem, check two key stats in the AWS console: the "throttle" for the lambda, and the DLQ queue size. If you see a lambda that suddenly gets very throttled which correlates with lots of messages ending up in your DLQ, but see no errors in your logs, this is likely your culprit.

It seems crazy that it works this way, and seemingly has for years. AWS's internal code is doing the wrong thing, and wasting developer hours across the globe. Ethically, there's also the question of whether you're getting billed for all of those erroneous message receives. But I'm mostly worried about having a bad system that is a pain in the ass to detect to work around.