AWS Fault Injection Simulator (FIS) allows you to create and run experiment templates that restart EC2 instances, create packet loss faults or failover database instances/clusters. As it can prove difficult to predict how your AWS resources will respond to interruption in hardware, or even a full outage of an Availability Zone, integrating automated resiliency testing into your environment can prove invaluable. Let’s have a look at how we can use AWS FIS to automate monthly RDS Instance failover to prove that our Multi-AZ RDS Instance is as fault-tolerant as we expect it to be.
Before you start following this AWS FIS experiment setup, make sure you have a Multi-AZ RDS instance to target, and a CloudWatch Log Group to use.
We will start by creating a new Experiment Template in the AWS FIS Console. Name the experiment as you like and choose the action aws:rds:reboot-db-instance. Target can remain default, we only need the single target for this test. Don’t forget to check the Force Failover option!
For the Target definition, we need to define the aws:rds:db Resource Type, after which we can select the database Resource ID from a dropdown list (this will show multiple Database and will allow you to select any or all of them, if you have several RDS instances running in your account).
The default policy created for the default FIS role is as displayed below. Always make sure to scope this to the RDS Instance or Cluster you are working with. If you let the service create the Role you are using, you can do this afterwards by going to the Role in IAM, and changing the attached policy to scope to your specific database.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "fis.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
},
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowFISExperimentRoleRDSReboot",
"Effect": "Allow",
"Action": ["rds:RebootDBInstance"],
"Resource": "arn:aws:rds:eu-central-1::db:*",
},
{
"Sid": "AllowFISExperimentRoleRDSFailOver",
"Effect": "Allow",
"Action": ["rds:FailoverDBCluster"],
"Resource": "arn:aws:rds:eu-central-1::cluster:*",
},
],
},
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogDelivery"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:PutResourcePolicy",
"logs:DescribeResourcePolicies",
"logs:DescribeLogGroups"
],
"Resource": "arn:aws:logs:eu-central-1::log-group:aws-fis-rds-failovers"
}
]
}
We can now run our Experiment. As soon as the task has finished, a message will popup, claiming a ‘Terminal’ state has been reached, this can be either a failure or a successfully completed run so make sure to check the CloudWatch Log Group.
In CloudWatch, you will see that every step has been logged. The important part here is the Action State conclusion. If you’ve followed things correctly up until now, your RDS instance has successfully failed over to the standby instance in the second AZ.
"action_state":
{
"status": "completed",
"reason": "Action was completed."
}
Awesome! AWS FIS definitely provides a simple method to test fail overs. This, however, was just a manual run but what we really want to do is prove that our database fails over without issue on a regular, repeatable basis. So let’s see how to automate running our FIS template. Of course, there are more ways to go about this, such as running AWS FIS automated from a codepipeline build. But for simplicity for this deployment, we will deploy a Lambda function to run the experiment.
Make sure to grant the Role which is used by the Lambda function the fis:StartExperiment IAM rights. Look up your FIS Experiment template ID in the console and create a Lambda to run your FIS template automatically. Using Boto3 (see Boto3 documentation), we can write a simple script to start our experiment (don’t blindly use below example; make sure you add error handling at a minimum).
import boto3
client = boto3.client('fis')
def lambda_handler():
response = client.start_experiment(
experimentTemplateId='EXT6aFeLV1Y2hudDz',
)
Lastly, create an EventBridge Rule of the type Schedule, to run your Lambda on a scheduled basis, and you’re done! Scheduled, automated RDS instance failover to test your cluster resiliency and failover handling for the application.