AWS Step Function Tricks (part 2)

Error handling and Recursive workflows

James Turner
Analytics Vidhya

--

AWS Step functions in action

If you haven’t already, begin your journey with the original AWS Step Function Tricks story to get started on some of the basic tricks you can do.

In this article we’re going to cover how to:

  • Handle errors in a more concise manner, keeping our workflows clean
  • Build recursive workflows for sequential date based jobs

Error Handling

So in most AWS Step Functions states you can specify a Catch section to allow you to handle errors. Typically you want to notify yourself and your team in case anything should happen unexpectedly during your workflow — batch jobs fails, EMR Cluster dies, Docker container fails to start, Lambda errors, etc, etc. As such your workflow might look like this (trivial) example where for each step Hello, World, Foo, Bar you’re catching the errors and notifying yourself before failing the workflow.

A workflow where each step has a catch handler

As you can see this gets a little more horrible with each step you have to append to this workflow.

Here’s the step function declaration that would get us there (DON’T COPY THIS):

{
"Comment": "One handler per state",
"StartAt": "Hello",
"States": {
"Hello": {
"Type": "Task",
"Resource": "<LAMBDA_FUNCTION_ARN>",
"Next": "World",
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"ResultPath": "$.error",
"Next": "Send Failure Message"
}
]
},
"World": {
"Type": "Task",
"Resource": "<LAMBDA_FUNCTION_ARN>",
"Next": "Foo",
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"ResultPath": "$.error",
"Next": "Send Failure Message"
}
]
},
"Foo": {
"Type": "Task",
"Resource": "<LAMBDA_FUNCTION_ARN>",
"Next": "Bar",
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"ResultPath": "$.error",
"Next": "Send Failure Message"
}
]
},
"Bar": {
"Type": "Task",
"Resource": "<LAMBDA_FUNCTION_ARN>",
"Next": "Job Succeeded",
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"ResultPath": "$.error",
"Next": "Send Failure Message"
}
]
},
"Job Succeeded": {
"Type": "Succeed"
},
"Send Failure Message": {
"Type": "Pass",
"Next": "Fail Workflow"
},
"Fail Workflow": {
"Type": "Fail"
}
}
}

So this is great, except that for each additional step we add in we have to add the catch component to each step. It adds a lot of bloat to our already horrid looking workflow.

If we restructure this we can have a workflow that only needs to catch once. Think of it as an outer exception handler that you would normal create in your code, except we’re applying the same principle to workflows. Obviously the effectiveness of this increases once you have more than just 2 states in your workflow. We can leverage the power of the Parallel state to get us what we want. We want something more like this:

Catchall error handling for AWS step function states

The better error handling step function declaration. You can see that the states inside are a lot more concise and easier to manage, and the diagram is. much nicer on the eyes too. (DO COPY THIS)

{
"Comment": "Better error handling",
"StartAt": "ErrorHandler",
"States": {
"ErrorHandler": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "Hello",
"States": {
"Hello": {
"Type": "Pass",
"Result": "Hello",
"Next": "World"
},
"World": {
"Type": "Pass",
"Result": "World",
"Next": "Foo"
},
"Foo": {
"Type": "Pass",
"Result": "World",
"Next": "Bar"
},
"Bar": {
"Type": "Pass",
"Result": "World",
"End": true
}
}
}
],
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"ResultPath": "$.error",
"Next": "Send Failure Message"
}
],
"Next": "Job Succeeded"
},
"Job Succeeded": {
"Type": "Succeed"
},
"Send Failure Message": {
"Type": "Pass",
"Next": "Fail Workflow"
},
"Fail Workflow": {
"Type": "Fail"
}
}
}

No more Catch statements everywhere 🤗; we can add new steps in without having to remember to bind each one to the error handling.

Recursive workflows

So AWS says “don’t do this”. The graph you’re “supposed” to create as far as they are concerned is a directed acyclic graph. Well that’s fine except when you want to actually do something repeatedly until you’ve completed your task. A good example of this would be backfilling some data from date A to date B.

Recursive date based AWS step function workflow

Handily we’ve lift the error handling example from earlier in this article and built on top of that so any failures will manifest their errors for us 😉. We’re also going to make use of a newer feature of AWS step functions (released August 2020), the TimestampLessThanEqualsPath comparator, to allow us to compare 2 different variables in our input. Here we can utilise a startDate and endDate to bound the range we want our recursive workflow to operate over.

{
"Comment": "Better error handling",
"StartAt": "ErrorHandler",
"States": {
"ErrorHandler": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "Is Date <= X",
"States": {
"Is Date <= X": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.startDate",
"TimestampLessThanEqualsPath": "$.endDate",
"Next": "Run the job workflow"
}
],
"Default": "Backfill complete"
},
"Backfill complete": {
"Type": "Pass",
"Result": "World",
"End": true
},
"Run the job workflow": {
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync",
"Parameters": {
"StateMachineArn": "<STATE_MACHINE_ARN>",
"Input": {
"date": "$.startDate",
"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
}
},
"Next": "Add 1 day"
},
"Add 1 day": {
"Type": "Task",
"Resource": "<LAMBDA_FUNCTION_ARN>",
"Parameters": {
"date.$": "$.startDate"
},
"ResultPath": "$.startDate",
"Next": "Is Date <= X"
}
}
}
],
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"ResultPath": "$.error",
"Next": "Send Failure Message"
}
],
"Next": "Job Succeeded"
},
"Job Succeeded": {
"Type": "Succeed"
},
"Send Failure Message": {
"Type": "Pass",
"Next": "Fail Workflow"
},
"Fail Workflow": {
"Type": "Fail"
}
}
}

Just fire up this workflow with a {"startDate": "2020-09-01T00:00:00Z", "endDate": "2020-09-09T00:00:00Z"} and you’ll get 9 iterations of workflow each executed with 1 day difference.

This workflow does rely on a lambda being available to subtract 1 day from your date input to overwrite the $.startDate , but i’ll leave that up to you to implement. I’ve got to leave you something fun to do 😜

Conclusion

It’s possible to have decent error handling that doesn’t require a lot of effort to implement and catches ALL errors within your workflows. You should now be able to implement recursive workflows to do time based operations using the one of the new choice operators available.

Thanks for reading.

--

--