Learn how to orchestrate long-running serverless workflows with AWS Step Functions, how they're billed, and how to optimize their costs.
If you were asked to create a workflow that could run for a few months, and you had to use serverless technologies, how would you design it? Would you turn to AWS Fargate, or would you push back on the requirements and spin up an EC2 instance to track the whole process? Well, you wouldn't actually have to do any of that, since this is what AWS Step Functions were created for.
In this article, you'll see what Step Functions are, and how AWS charges you to use them. After that, we'll look at the pros and cons of the billing structure and end with a few strategies to optimize your costs.
Step Functions allow you to orchestrate multi-step processes in a serverless solution. They call resources in other AWS services, branch based on conditions, iterate over arrays, and handle errors. You can think of Step Functions as essentially being AWS's low-code solution for serverless.
You may be thinking, why not just do that all in a Lambda? As stated in the orchestration anti-pattern section of the Lambda operator guide, that can quickly turn into unmanageable spaghetti code. You'll also get double-billed, since you'll have a Lambda running the orchestration that you have to pay to have it wait on other Lambdas to finish.
One final benefit of using Step Functions over a Lambda to orchestrate a serverless process is how long a Step Function can run for. As of this writing, a Lambda can only run for 15 minutes, but a Step Function can run for up to 1 year.
You can think of Step Functions as essentially being AWS's low-code solution for serverless
AWS Step Functions has two different types of workflows: standard and express. The two workflows operate quite differently, so it's important to understand how they work before picking one. If you want more details about the differences between the two workflow types, AWS has a page in the Step Functions Developer Guide that compares the two.
Standard workflows are the original workflow type. They orchestrate long-running processes and can be left running for up to a year. You're charged per step transition, so you can have a step run for quite a long time, and you won't get charged any more for it.
Express workflows, as the name implies, are for shorter-lived workflows. The longest they're allowed to run is five minutes, and you're charged by the GB-second and per request. If you have experience with Lambda development, this should sound familiar.
Express workflows transition between states faster, which makes the whole workflow run faster. Express workflows run a state transition in about 20ms versus a standard workflow, which takes between 350ms and 670ms to complete.
Learning new languages only to implement a small part of a solution can be a pain, but the Amazon States Language is really just a JSON schema, and it isn't too difficult to learn. Once you get the hang of it, it's actually pretty easy to develop a state machine.
Once you have your state machine developed and deployed, you can pretty much forget about it. AWS manages the service so you don't have to scale it, upgrade it, or deal with outages.
If you're using a standard workflow, it provides a detailed history view for each execution. You can drill down and see what the inputs and outputs for every step were, and you can see any exceptions that occurred. This makes tracking down errors really easy.
Once you have your state machine developed and deployed, you can pretty much forget about it
Part of the serverless trilemma is the concept of double billing. This happens when you're paying based on time, and one execution has to wait idly for another execution to finish.
Double billing often happens in express workflows, since you pay for time. You can cut down on some of the double billing by calling AWS services directly from your workflow, but anytime you have to call into a Lambda, you'll be double billed for it. You'll have to pay for cold starts too.
Now that you've seen how Step Function billing works and some of the benefits and drawbacks, we can look at ways to optimize their costs. Like everything else in software development, these strategies aren't one size fits all. You'll have to review the metrics of your system to see if any of these work in your specific situation.
As usual, any cost optimization initiative benefits from a robust tagging strategy - the more you’ve tagged your environment inline with how you think about cost centers for your business, the more valuable insights you’ll get. If you want advice on setting up tags, check out our article here.
In the Step Function Best Practices, AWS recommends using standard workflows for long running processes and express workflows for high-volume, short processes. But what if your application doesn't neatly fit into one of those two categories?
Because of the two different billing models used, there is a tipping point where the cost of using an express workflow becomes higher than using a standard one. This point changes depending on your application, but you can calculate it by taking the number of state transitions in your workflow and multiplying it by the standard workflow price. Then divide that number by how much an express workflow charges per second.
To keep things simple, let's say you have a Step Function with 25 state transitions on average. Currently, that would cost $0.000625. As of this writing, an express workflow costs $0.00001667 per GB-second. So if we divide the first value by the second, we find out that if our express workflow takes more than 37.5 seconds to complete, it's cheaper to use a standard workflow.
If you're using an express workflow, you're going to get double billed for all of the run time of your Lambdas. This also means any optimizations that speed up your Lambda will save you twice the amount too.
Depending on what you're running in your Lambda, you may be able to speed up execution by assigning more memory to it. This works because assigning more memory gives the Lambda a higher percentage of the CPU clock cycles. Sometimes, this speed up of the Lambda can be great enough to more than offset the increased cost of the extra memory. Again, you'll need to run some experiments to see how your code responds to different sized Lambdas.
If you're using an express workflow, you're going to get double billed for all of the run time of your Lambdas
Since express workflows are charging based on run time, any reduction in Lambda cold-start times will help reduce your bill too. One way to do that is to use provisioned concurrency.
Lambda Provisioned Concurrency keeps a set number of instances warm, so you don't have to wait for cold starts. This feature also charges for the GB-second per number of instances that you want to keep warm. It can get quite costly itself, but depending on the scale of your application, it can help your bill.
AWS Step Functions can call into most services directly using AWS SDK Service Integrations. You can use this feature to avoid some double billing when you need to write something to a DynamoDB Table.
While Step Functions are easy to use and a great solution, you may reach a point where they're costing a lot. Maybe you have a standard Step Function that gets called a lot, and it has a bunch of state transitions in it.
One workaround is to use an EventBridge event bus. You can have the event bus route events to different Lambdas depending on the event type. After the Lambda finishes processing the event, it can send a new event back to the same bus.
You'll lose out on the benefits of Step Functions by using this approach, but if your development team can manage it, you can reduce your bill by a fair amount. Currently, an event bus charges $1 per million events. If you used state transitions instead, it would cost $25 per million.
In this article, you've seen how AWS Step Functions solve the need to orchestrate multiple processes in a serverless environment. You learned how Step Function costs are calculated, and some of the benefits and drawbacks of that. Finally, you got to see a few strategies to reduce your Step Function bill. You should now be able to take this knowledge and implement it to improve your serverless costs and workflows.