Autoscale Redis Clusters with AWS Step Functions and Lambda

A graphic representation of our Redis clusters using AWS Lambda.

At Rewind, we’ve got a lot of data to move, store, and secure – nearly 2 petabytes worth in multiple AWS VPCs. As you may know, we use AWS Lambda to execute (some) of the serverless functions of the Rewind Vault. Elasticache is used for our Redis clusters, an in-memory data store. This provides enough memory for our Sidekiq workers to store job and operational data.

The problem:

The issue comes, as they typically do, during high demand. If we get an unexpected volume of backup jobs, this can drive higher than usual Redis memory use and hence affect Sidekiq operationally. As good DevOps folks, we have an alarm to let us know when this occurs so we can respond appropriately. That response today means upsizing the Redis cluster to a bigger instance size with more memory and then investigating what has caused the requirement for more memory use.

Given backups run continuously at Rewind, the alarms on Redis memory use can occur at any time – day or night. While it doesn’t happen often, this got to be slightly annoying (and difficult for our DevOps team’s sleep cycles). We started to wonder if there was a way to automatically remediate these alarms so we can continue sleeping. What if there was some way to “auto-scale” AWS Elasticache Redis clusters?

The solution:

Due to the chronological semblance of our overall process for upsizing our Redis clusters, we were looking for a service that could execute AWS Lambda functions in sequence. AWS Step Functions, an orchestrator that helps to design and implement complex workflows, was the tool we needed.

We created the Redis Autoscaler, which can automatically upsize our clusters. It uses AWS Step Functions, AWS Lambda, and an AWS EventBridge rule to automate this process, and all the resources are conveniently defined in a SAM template. Now the team is only alerted if the Redis Autoscaler has failed, resulting in a more stable production environment (and happier DevOps!).

Let’s get into the details of how we implemented this solution.

What are Step Functions?

AWS Step Functions use a State Machine to define the states of each step in a workflow using Amazon States Language. They are defined using JSON. Here’s a brief overview of the different States:

‘Task’ does some work in the state machine.
‘Choice’ makes a choice between branches of execution.
‘Fail’ or ‘Succeed’ will stop an execution with either a failure or success to meet defined criteria.
‘Pass’ passes its input to its output or injects fixed data.
‘Wait’ provides a delay of a certain period of time (or until a specified time).
‘’Parallel’’ states begin parallel branches of execution.

Since there’s a lot to know about Step Functions and their states, we won’t go into all the details here – but you can find more documentation about states here.

Overview of Rewind’s State Machine

Our manual process of upsizing Redis clusters was modeled into a State Machine. We used AWS Lambda functions that use Python 3.8 and the Boto3 library to define the steps in our State Machine and interact with our AWS resources.

Here’s a diagram detailing the finished product:

In our case, there are other external steps beyond simply updating the Elasticache Redis cluster and picking a new instance size. We have some external processes we need to pause and – most importantly – we want to take a snapshot of the Redis cluster before we modify it. Step functions allow us to sequence all of these various operations into a single, cohesive workflow.

Lessons Learned

Like any new project, there were plenty of kinks to iron out and problems to solve. Here are some of the lessons we learned throughout the process.

Triggering the Step Function:

The goal of the Redis Autoscaler was to automate the process of upsizing our Redis clusters when they were utilizing high amounts of memory. Thus, the goal was to trigger our Step Function when our alarms for high memory usage on our Redis clusters were in an alarm state.

We chose to use an AWS EventBridge rule to trigger our Step Function because of its flexibility (the rule encompasses all alarms within a region that match a defined condition) and its compatibility with AWS Step Functions. The EventBridge rule is defined in a SAM template:

CloudWatchAlarmEventRule:

Type: AWS::Events::Rule

Properties:

EventPattern: {

"source": ["aws.cloudwatch"],

"detail-type": ["CloudWatch Alarm State Change"],

"detail": {

"state": {

"value": ["ALARM"]

},

"configuration": {

"description": [{

"prefix": "Redis memory:"

}]

}

}

}

State: ENABLED

Targets:

- Id: StepFunction

Arn: !GetAtt StateMachine.Arn

RoleArn: !GetAtt CloudWatchAlarmEventRuleIAMRole.Arn

EventBridge allows for different types of filtering for event patterns. We chose to leverage prefix filtering by prepending ‘Redis memory:’ to all our CloudWatch alarm descriptions for Redis memory usage. In the resource definition for the EventBridge rule, under EventPattern, you can see how we configured the rule to use prefix filtering. The target in this rule is our Step Function and this is how the Step Function gets triggered on alarm.

Parallel States:

One key piece of functionality we wanted to integrate was good error handling. Specifically, we wanted to ensure that if our Step Function failed at any given point, that the appropriate on-call personnel was notified. However, our Step Function contained numerous tasks and states, and we wondered if there was a way to encapsulate multiple states into a single state. This is where we leveraged the Parallel state function.

The ‘Parallel’ state is typically used for running parallel branches concurrently. The state of the ‘Parallel’ state itself is controlled by the state of its branches. Thus, we leveraged the fact that we can fail the entire ‘Parallel’ state if any states defined in the branches fail by wrapping all of the states involved in the process of upsizing Redis into a single ‘Parallel’ state. Thus, we would immediately be able to push an SNS message to our alerting system if our Step Function has failed at any point.

Input and Output Control:

When going from one state to the next, the default input and output behavior is that the output of the first state is the input on the next state. However, within the State Machine definition, we have full control over the inputs and outputs of any defined state. An example of when we had to deviate from the default input and output control was for our ‘Parallel’ state. See this snippet of the JSON State Machine definition:

"Catch": [

{

"ErrorEquals": [

"States.ALL"

],

"ResultPath": null,

"Next": "Failed"

}

],

"ResultPath": null,

"Next": "Succeeded"

},

"Failed": {

"Type": "Pass",

"InputPath": "$",

"Parameters": {

"id.$": "$.id",

"account.$": "$.account",

"time.$": "$.time",

"region.$": "$.region",

"detail.$": "$.detail",

"status": "fail"

},

"Next": "PushCloudWatchAlarmToPagerDuty"

},

"Succeeded": {

"Type": "Pass",

"InputPath": "$",

"Parameters": {

"id.$": "$.id",

"account.$": "$.account",

"time.$": "$.time",

"region.$": "$.region",

"detail.$": "$.detail",

"status": "success"

},

"Next": "PushCloudWatchAlarmToPagerDuty"

},

The snippet above shows the ‘Catch’, ‘ResultPath’, and ‘Next’ fields of the ‘Parallel’ state, and the contents for the ‘Failed’ and ‘Succeeded’ states. Our entire State Machine begins with the input of a CloudWatch alarm payload in JSON format.

For our use case, regardless of whether our ‘Parallel’ state fails or succeeds, we wanted to output the original CloudWatch payload, however, the default behavior output for the ‘Parallel’ state would be the last output of the last state to execute within the entire ‘Parallel’ branch. Thus, looking at ‘ResultPath’ and the ‘ResultPath’ in the ‘Catch’ block, we set it to null. This allows us to output the original input for the entire ‘Parallel’ branch, which is our CloudWatch alarm payload, as intended.

Since our ‘Parallel’ branch is always outputting the original CloudWatch alarm payload, we needed some way to include an additional input parameter to determine whether or not our ‘Parallel’ state had run successfully or failed. Under the ‘Failed’ and ‘Succeeded’ states, in the Parameters block, we are including all of the original parameters (id, account, time, region, detail), and then including an additional parameter of ‘fail’ or ‘success’.

Handling Loops:

If you looked closely at the State Machine diagram above, you might notice that there are some states that create a loop. Here’s an example of a loop created by our State Machine:

As part of the process of upsizing, we create a snapshot of the Redis cluster before we upsize the ElastiCache instance. The process of creating a snapshot can take over 30 minutes. As opposed to having a long-running Lambda function to poll for the status of the snapshot, we chose to create a loop within the State Machine that polls for the snapshot status every 30 seconds. The execution time of the polling Lambda function is fractions of a second, so this design pattern minimizes Lambda costs (however small they are).

However, this solution brings the possibility of long-running loops (the quotas on running AWS Step Functions can be found here). Here is how we defined the three states in the State Machine that are involved in this loop:

"RedisCreateSnapshot": {

"Type": "Task",

"Resource": "${RedisCreateSnapshotFunctionArn}",

"Next": "RedisGetSnapshotStatus"

},

"RedisGetSnapshotStatus": {

"Type": "Task",

"Resource": "${RedisGetSnapshotStatusFunctionArn}",

"Retry": [

{

"ErrorEquals": [

"States.TaskFailed"

],

"IntervalSeconds": 15,

"MaxAttempts": 5,

"BackoffRate": 1.5

}

],

"Next": "IsSnapshotAvailable"

},

"IsSnapshotAvailable": {

"Type": "Choice",

"Choices": [

{

"Variable": "$.snapshot_status",

"StringEquals": "available",

"Next": "RedisUpsizeReplicationGroup"

},

{

"Variable": "$.loop_count",

"NumericGreaterThanEquals": 120,

"Next": "FailParallelBranch"

}

],

"Default": "SnapshotCreatingWait30Seconds"

},

"SnapshotCreatingWait30Seconds": {

"Type": "Wait",

"Seconds": 30,

"Next": "RedisGetSnapshotStatus"

Looking at the definition above, we can see that ‘RedisGetSnapshotStatus’ is a ‘Task’ state that runs the Lambda polling function. Its ‘Next’ state is ‘IsSnapshotAvailable’, which is a ‘Choice’ state that simply checks the ‘snapshot_status’ parameter and checks if it is equal to ‘available’. If it is, the loop breaks and it proceeds to the next step. However, if it isn’t, we wait 30 seconds by proceeding to the ‘Wait’ state. Notice that in the ‘Choices’ block under ‘IsSnapshotAvailable’, there is an additional choice that looks at a parameter called ‘loop_count’. The way we avoid an infinitely running loop is by creating and iterating this ‘loop_count’ parameter in the ‘RedisGetSnapshotStatus’ step. We see that if ‘loop_count’ exceeds 120, then the next step is to fail the entire ”Parallel’ branch as intended. A ‘loop_count’ of greater than 120 indicates that this snapshot has taken over 3600 seconds, or 1 hour, and thus requires immediate investigation (120 [loop_count] * 30s [30 second Wait states] = 3600s).

Conclusion:

AWS Step Function provides flexible functionality that is a good fit for automating complex processes within AWS resources. Building the Redis Autoscaler provided many learnings for the DevOps and engineering team, and we hope the lessons we learned are helpful for your future projects. At the very least, our DevOps team is sleeping soundly through the night, which is a win in our books.

Interested in solving problems like this? Want to get paid at the same time?? Check out Rewind’s open positions to start your career in DevOps, Security, Engineering, and more.

DJ Pham">

DJ Pham

DJ Pham is a DevOps Engineer at Rewind. After obtaining a Bachelor of Mathematics from the University of Waterloo, DJ brought his data analysis talents to Rewind. When he isn't automating solutions, DJ enjoys exploring the Ottawa area on his bike or watching every basketball game he can.

DJ Pham

Read next on Engineering

How we removed 60 billion objects from S3 without breaking the bank (or S3)

Recover a deleted GitHub repository

Mastering AWS lifecycle configuration: How long is a year, anyway?