While setting up a customer project, I encountered an AWS corner case involving an IAM role (`example-sqs-sender-role`) and an SQS queue policy (`example-queue`). The policy was configured to allow the role to send messages to the queue. The role was deployed separately from the SQS queue and policy, and the SQS queue policy was explicitly referencing the role via a principal reference.
You might wonder why the resources were deployed separately. This was due to the use of layers in Terraform, which will be described in a coming blog but is out of scope for the topic here.
After deleting and creating the role anew, I observed some unexpected behavior. Users assuming the new role could no longer send messages to the queue. In this blog, I am sharing the lessons learned from this corner case.
Details
The SQS queue policy allowed the role to send messages to the queue by explicitly allowing it in the queue policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowSender",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789123:role/example-sqs-sender-role"
},
"Action": "sqs:SendMessage",
"Resource": "arn:aws:sqs:eu-central-1:123456789123:example-queue"
}
]
}
When the role was deleted and later recreated, the SQS policy was neither deployed nor modified as part of the deployment. However, users assuming the new role were no longer able to send messages. Upon investigation, I discovered that AWS had automatically replaced the role ARN in the policy with an identifier prefixed by AROA, effectively invalidating the permission.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowSender",
"Effect": "Allow",
"Principal": {
"AWS": "AROATCKAOIHV3OY4IAMTV"
},
"Action": "sqs:SendMessage",
"Resource": "arn:aws:sqs:eu-central-1:123456789123:example-queue"
}
]
}
Initially, I thought this might be an issue with how I referenced the role in my infrastructure as code. But after testing it by manually creating the role and queue, then deleting the role, I observed the same behavior.
Why does this happen?
AWS principals have unique Amazon Resource Names (ARNs). But besides these ARNs some of these principals such as users, roles and user groups, also have an “internal” unique identifier (1). For roles this identifier is a string that begins with AROA.
To get the unique ‘RoleId’ for a deployed role, we can run the following CLI command:
$ aws iam get-role --role-name example-sqs-sender-role --output json | jq .Role.RoleId "AROATCKAOIHV7CDAXPK5U"
Now when the role is removed and recreated again with the same name and properties, the RoleId changes:
$ aws iam get-role --role-name example-sqs-sender-role --output json | jq .Role.RoleId "AROATCKAOIHV3OY4IAMTV"
Every time a resource is recreated, the RoleId changes to a different AROA prefixed ‘RoleId’.
In the SQS case, AWS changed the value of the principal element in the policy to the ‘RoleID’ of the role. On first sight it was not apparent why this happened.
It turned out to be a well-documented security feature designed to prevent privilege escalation, ensuring that a newly created role cannot automatically inherit permissions granted to the old role (see appendix).
How to mitigate the issue
If, for whatever reason, resources like our role and queue are deployed separately, is there a way to circumvent this behavior?
Instead of setting the role as a principal, you could use a condition that matches the ARN of the assumed role by checking aws:PrincipalArn (2) instead.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowSender",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "sqs:SendMessage",
"Resource": "arn:aws:sqs:eu-central-1:123456789123:example-queue",
"Condition": {
"StringLike": {
"aws:PrincipalArn": "arn:aws:iam::123456789123:role/example-sqs-sender-role"
}
}
}
]
}
Now, when the role is deleted, the condition remains unchanged, and the role is still able to allow sending to the queue after a redeploy of the role. AWS doesn’t change the content of the Condition the way it does for the Principal element. Of course, as we saw earlier, this is a suboptimal policy because of the privilege escalation danger, so it should be used with caution.
Lessons Learned
- The ARN used in the principal field in policies is actively tracked by AWS. When a role is deleted, AWS updates policies to reference the ‘RoleId’ instead.
- Directly referencing an ARN in policies for independently deployed resources can cause issues upon redeployment.
- To avoid these issues, either:
- Deploy the role and queue together, ensuring the ARN and reference remain consistent.
- Use a condition with `aws:PrincipalArn` instead of specifying the ARN directly.