Error handling in AWS Lambda triggered by SQS events

In the recent post I described the new feature of triggering Lambda functions by SQS events.

One of the implications of using this feature is that you do not need to delete SQS messages in Lambda function because AWS will do it for you, but only when the Lambda completes successfully.

In case when your Lambda fails the message will remain in the queue and become visible to Lambda function again after the time of “visibility timeout” which is a period of time during which Amazon SQS prevents other consumers from receiving and processing the message (defaults to 30 sec.)

You can configure this timeout on SQS queue ( when creating with VisibilityTimeout attribute) or for a specific message (with a ChangeMessageVisibility call )

If you won’t handle such situations, it may happen that after some time when your Lambda is running and messages are failing, you will waste AWS computing power, loose money and your SQS queues will grow with dangling unprocessed messages.

Eventually, each message will be deleted from SQS queue after expiry time (an attribute MessageRetentionPeriod which you can specify when creating SQS queue, defaults to 4 days) but you should not make it happen. You should control the situation yourself.

How we can control it? At least two possibilities came to my mind. If you know other, please share in comments.

Dead-Letter Queue or DLQ

DLQ is a name for queue dedicated for keeping failed messages. From technical point of view it does not differ from any SQS queue. This is only about the purpose of the queue. When the message cannot be handled properly in the Lambda function it can be passed to the DLQ and from there a separate function can handle it, e.g. send again to the primary Lambda function, send notifications to Admin or just delete.

Here I will show you an example of application using DLQ.

I wish it was as simple as it should be - according to Serverless documentation you should only specify a onError property on your Lambda function like that:

functions:
  receiver:
    handler: receiver.handler
    events:
      - sqs:
          arn:
            Fn::GetAtt:
              - MyQueue
              - Arn
    onError: arn:aws:sqs:us-east-1:811338114639:ReceiverDLQ

where you specify ARN to an existing SQS queue.

Unfortunately, currently it fails with error:

onError currently only supports SNS topic arns due to a race condition when using SQS queue arns and updating the IAM role. Please check the docs for more info.

This is mentioned in the doc as well.

Fortunately, there is a work around - not using onError property and configure it with old good CloudFormation style.

First, you need to define new SQS queue in resources section of your serverless.yaml if you want Serverless to create it during deployment.

resources:
  Resources:
    ReceiverDeadLetterQueue:
      Type: "AWS::SQS::Queue"
      Properties:
        QueueName: "receiverDLQ"
        MessageRetentionPeriod: 1209600 # 14 days in seconds

Then you need to add error handling configuration. Here you have two choices:

Configure DLQ with DeadLetterConfig on Lambda function
Configure RedrivePolicy on the source queue.

I tried the first option but I could not make it work. My trial project which fails, with all the explanation, is also on Github.

Here I will describe the second option which worked fine.

First, we will define resources which will be created when new application stack is created:


resources:
  Resources:
    MyQueue:
      Type: "AWS::SQS::Queue"
      Properties:
        QueueName: "MyQueue"
        VisibilityTimeout: 30
        MessageRetentionPeriod: 60
        RedrivePolicy:
          deadLetterTargetArn:
            "Fn::GetAtt":
              - ReceiverDeadLetterQueue
              - Arn
          maxReceiveCount: 1
    ReceiverDeadLetterQueue:
      Type: "AWS::SQS::Queue"
      Properties:
        QueueName: "receiverDLQ"
        MessageRetentionPeriod: 1209600 # 14 days in seconds

We defined two queues: MyQueue and ReceiverDeadLetterQueue and configured redrive policy on the source queue (MyQueue). This way when the message was received by the source queue more then maxReceiveCount times, it will be redirected to ReceiverDeadLetterQueue.

What is important to remember is that if you tried to view messages in MyQueue from AWS Console, it will also count as “receiving message” so your Lambda function would not have then chance to get it, because it would be sent straight to the DLQ after you closed the viewer screen.

Now when we deploy the application stack to AWS we can do a test.

The source code is available here in branch redrive-policy.

We can first test if function works as expected by submitting JSON message to MyQueue and checking logs:

export QUEUE_URL=`aws sqs get-queue-url --queue-name MyQueue --query 'QueueUrl' --output text --profile=sls`
aws sqs send-message --queue-url ${QUEUE_URL} --message-body "{ \"text\" : \"test2\"}" --profile=sls
sls logs -f receiver -t

It should log something like:

2018-07-19 20:40:19.789 (+02:00)	b0a7c784-eeea-5901-8786-944e33b92c18	event:  {"Records":[{"messageId":"3a0a90f5-8fb5-4cbb-8b46-017039272637","receiptHandle":"AQEBfT4rwo5KGoqL1RGBVp4WlsSumH5ToYEENfhDGPXUv2RPdLNDsUTJ/3QjeevDqK5YPrQ1xoSixvACIYlMD7HtXNh0iRRS8VBqutR7tx8ZMkucagRweZ3WyYxXImrVcsD33NIPFlUeldQjRTTnPKfjo+BCZ5TZCs1ndnSKVrqwMzMi7MwTDtI4a1b0IxEJdhiWot7eJ7EwaPmdfVpgK6K/vUQeW+jep3M1cUhNGwoz7H/2bL2FVX54uzfjoShLY7JymmkAuPzqQ/3KQPPvOyMvuragXmzE8VxA+phHHl66hWBj0nlOSWwYbXAFI765/Ik6zPk+nwKm2PoQs3hvPl1aDzAU1FHtqYqaH6GmJ7AAaHwNnufv83ez2hoZGv2AhhDE","body":"{ \"text\" : \"test2\"}","attributes":{"ApproximateReceiveCount":"1","SentTimestamp":"1532025619465","SenderId":"AIDAJLGFXMWT34E5GOTAS","ApproximateFirstReceiveTimestamp":"1532025619496"},"messageAttributes":{},"md5OfBody":"9be3787fa6959a8ed52b3fd5be1aa95a","eventSource":"aws:sqs","eventSourceARN":"arn:aws:sqs:us-east-1:811338114639:MyQueue","awsRegion":"us-east-1"}]}
2018-07-19 20:40:19.790 (+02:00)	b0a7c784-eeea-5901-8786-944e33b92c18	text:  test2

Then let’s cause an error in Lambda by sending a message with wrong JSON body, e.g.:

aws sqs send-message --queue-url ${QUEUE_URL} --message-body "test" --profile=sls

Watch application logs:

sls logs -f receiver -t

And you would see error like:

SyntaxError: Unexpected token e in JSON at position 1
    at Object.parse (native)
    at exports.handler (/var/task/receiver.js:15:29)

Lambda did not succeed so the message was returned back to MyQueue but it was already the second time (first time was when you’ve sent it) and it should be redirected to DLQ.

You can list messages in DLQ and see that new message appeared

export DLQ_QUEUE_URL=`aws sqs get-queue-url --queue-name receiverDLQ --query 'QueueUrl' --output text --profile=sls`
aws sqs receive-message --queue-url ${DLQ_QUEUE_URL} --visibility-timeout 0 --profile=sls

NOTE: by settings --visibility-timeout to 0 you allow to query the same message every time you execute the command and do not wait for the message to be visible again.

Now you can do what you want with the failed messages in DLQ, e.g. you could process them and decide wheather you should re-try and send it back to the main application queue MyQueue or delete it.

Error handling in Lambda function

Imagine that you do not want to setup DLQ and you prefer to handle errors yourself. How would the function look like?

Below is an example:

// receiverErrorHandling.js

'use strict';

var AWS = require('aws-sdk');
var sqs = new AWS.SQS({
	region: 'us-east-1'
});


exports.handler = (event, context, callback) => {

	const NUM_OF_RETRIES = 3;
	try {
		console.log('event: ', JSON.stringify(event));

		throw new Error("simulated error");

		// this will never be reached in our demo
		callback(null, "How did you get here??");

	} catch (e) {
		console.log('Handled error', e);

		// we will send new message with incremented count, if below retry limit, otherwise exit with status code 200
		// to allow AWS to remove SQS message, but return status message.
		var message = JSON.parse(event.Records[0].body); // message boody arrives as string JSON
		var retried =  message.retryCount | 0; // we've set batchSize=1 in sls config so it's save to get by index.
		if (retried > NUM_OF_RETRIES-1) {
			const response = "Failed after retries";
			console.log(response);
			callback(null, response);
		} else {
			retried++;
			message.retryCount = retried;

			// send a new message which is a copy of received message but with incremender retry counter.
			var accountId = context.invokedFunctionArn.split(":")[4];
			var queueUrl = 'https://sqs.us-east-1.amazonaws.com/' + accountId + '/MyQueue';

			var params = {
				MessageBody: JSON.stringify(message),
				QueueUrl: queueUrl,
				DelaySeconds: 10
			};

			sqs.sendMessage(params, function (err, data) {
				if (err) {
					console.log(err);
					callback( "Failed to retry after error" );
				} else {
					const response =  "Failed, but will retry " + retried + " time";
					console.log(response);
					callback(null,response);
				}
				
			});
		}
	}
};

This function handles the following main errors:

If the error is thrown in the try block, it is catched and we do not let it propagate outside, otherwise message will stay in the queue and arrive to the Lambda again.
In catch block we want to decide what to do about the error. In this case we let Lambda retry with the same message 2 more times. We have to send a new message enriched with retry count to the queue. In the message we set the DelaySeconds attribute because we do not want to immediately invoke this function. E.g. if in try block we would connect to some external services and the error resulted from it, we would expect that maybe after the delay time they would heal themselves and retry would succeed.
It could happen that we cannot send the message, e.g. because the SQS is temporarily unavailable. In this case we return error in Lambda callback and the same message will return to the queue. Lambda will process it one more time. In case we could not send the message again and again then the Lambda would retry until the message would expire in the queue, which we defined in serverless.yaml as 60 sec.

resources:
  Resources:
    MyQueue:
      Type: "AWS::SQS::Queue"
      Properties:
        QueueName: "MyQueue"
        VisibilityTimeout: 30 # 30 sec.
        MessageRetentionPeriod: 60 # 60 sec.

Note that message body comes from the event as a text, even if we send JSON, so we need to parse it to object ourselves.

	var message = JSON.parse(event.Records[0].body);

You can see also that Lambda can process multiple SQS messages at once and we refer to the first message:

event.Records[0]

In other circumstances it could be unsafe, but we defined Lambda to poll only single message in the configuration (batchSize), like that:

  receiver:
    handler: receiverErrorHandling.handler
    events:
      - sqs:
          arn:
            Fn::GetAtt:
              - MyQueue
              - Arn
      - batchSize: 1

You can grab the code from Github and test yourself.

You can send a JSON message

curl <sender url> -d '{"text" : "Hello"}'

and watch logs:

sls logs -f receiver -t

You will see the following sequence of logs:

START RequestId: 72ddf8e2-a835-5d10-9088-ab37291615f8 Version: $LATEST
2018-07-27 22:22:10.532 (+02:00)	72ddf8e2-a835-5d10-9088-ab37291615f8	event:  {"Records":[{"messageId":"5d6caf38-23ff-4408-bba0-85b538fad092","receiptHandle":"AQEBV5+zoh9qYCqoKGye8D4gYjpXp1cXTUzm2gNLyAvpEV3XBRsDtyPn81C0G4LaMm3urF6yOUi8or6Zg29V+c7MjRq1nvw4XBk9rGEADdkVqd4/YGo2eGdfbuSOoPVQUbfN2Qk36VG/rjHmQHAYNTudfC4dD6upMfLHNsQEvaiUCUViF7ONa7H0ZGqcvEqNSfv6wH992PpRsRHkgJ52KhqVGVkn2wGUzO7djUv4zMfWWtD7ZGI0N6DB15OpeTtZTYtW8qbzcINuj51/B3Ty5oKh864sidQddm4VuqQQkOEOKVu4n+j/fAW+yOX20RfxJno+mCoh04gD7eEvGI/XgpAKAYsInKEpC8Meu6rjl9Icy2GFUp14e3Za1M8u9VlzsOZV","body":"{\"text\" : \"Hello\"}","attributes":{"ApproximateReceiveCount":"1","SentTimestamp":"1532722929910","SenderId":"AROAJ77G56ARJABY7JWOY:sqs-triggers-error-handling-dev-sender","ApproximateFirstReceiveTimestamp":"1532722929911"},"messageAttributes":{},"md5OfBody":"35efbdc1c909ad760d1b85d56e450139","eventSource":"aws:sqs","eventSourceARN":"arn:aws:sqs:us-east-1:811338114639:MyQueue","awsRegion":"us-east-1"}]}
2018-07-27 22:22:10.532 (+02:00)	72ddf8e2-a835-5d10-9088-ab37291615f8	Handled error Error: simulated error
    at exports.handler.e (/var/task/receiverErrorHandling.js:17:9)
2018-07-27 22:22:10.702 (+02:00)	72ddf8e2-a835-5d10-9088-ab37291615f8	Failed, but will retry 1 time
END RequestId: 72ddf8e2-a835-5d10-9088-ab37291615f8
REPORT RequestId: 72ddf8e2-a835-5d10-9088-ab37291615f8	Duration: 174.19 ms	Billed Duration: 200 ms 	Memory Size: 1024 MB	Max Memory Used: 34 MB	
...
2018-07-27 22:22:20.768 (+02:00)	08200246-1f91-5f43-983f-51939ea57388	Failed, but will retry 2 time
...
2018-07-27 22:22:30.830 (+02:00)	4313584b-9c88-5da8-a5ac-0b9e1db69a3a	Failed, but will retry 3 time
...
2018-07-27 22:22:40.876 (+02:00)	7c1ab10d-66ab-58d1-ac12-e2d8e7ed3f43	Failed after retries
...

In the logs you can also see the timestamps of retries every 10 sec. This is what we wanted to achieve by setting the DelaySeconds on the message which we send to SQS.

Takeaways:

Working with SQS events can be tricky. You need to remember about several settings such as message delays, visibility time, retention time, batch size of processing
Error handling in Lambda can be achieved in many ways, either by manual handling with tr-catch or using Dead-Letter Queues or possibly a mix of both
Using Dead-Letter Queues allows to decouple error handling logic from the application logic. You can see that the second example with try-catch is a little bit of mess :)

Cloudly Tech

in Cloud it all happens

Error handling in AWS Lambda triggered by SQS events

Dead-Letter Queue or DLQ

Error handling in Lambda function

Takeaways:

References: