← All writing

The chicken-egg of cloud bootstrap, and how to crack it

A single-account, admin-key starting point — and the local-backend trick that lets one Terraform module manage its own state from zero.

The first time you wire GitHub Actions into a fresh AWS account, you hit a small riddle. You want Terraform to manage everything, including the S3 bucket that holds Terraform state. But that bucket has to exist before Terraform can read its own state from it. Most write-ups dodge the problem by hand-creating the bucket in the console and waving at it later. There is a cleaner answer that fits in one Terraform module, and it works just as well on a single-account project as it does on a fleet.

The single-account case is its own problem

Most "AWS plus GitHub OIDC" guides land you in Organizations, Control Tower, IAM Identity Center, and a four-account hub-and-spoke before you have shipped a single workflow. That audience is real. It is not everyone. If you are running one AWS account for one project, with an admin IAM user you created from the root sign-up flow, the standard enterprise scaffolding is overkill — and importing it makes the project look more complex than it is.

Overkill is not the same as licence to skip the discipline. The shape of the bootstrap problem is the same on a small account as on a large one: three artifacts must exist before the first CI run, and ideally all three live in version control. What changes for the small case is the size of the answer, not its rigour.

The chicken-egg, stated plainly

The three artifacts are an S3 bucket for Terraform state, a DynamoDB table for state locking, and an OIDC trust plus IAM role that GitHub Actions can assume. The catch is that the bucket holds Terraform's own state, which means Terraform needs the bucket to exist before it can manage anything — including the bucket.

The three answers I see most often are: (1) click the bucket together in the console and forget it ever happened, (2) write a shell script that creates the bucket with the AWS CLI and tells Terraform to leave it alone, or (3) split the account in two and bootstrap the second from the first. The first answer is undocumented infrastructure. The second is fine until somebody needs to recreate the account from scratch and the shell script has rotted. The third is back to Organizations.

The answer I keep landing on is to write a single bootstrap/ module whose main.tf declares an empty backend "s3" {} block but is initialized with terraform init -backend=false the first time. The first apply runs against a local terraform.tfstate on disk, creates the S3 bucket and the lock table, and then the same module migrates its own state into the bucket it just made.

What the module contains

The module is deliberately small. The state backend is an S3 bucket with versioning on, AES-256 server-side encryption, public-access fully blocked, and prevent_destroy set on the resource so a careless terraform destroy cannot wipe history. The lock table is a single DynamoDB table with one LockID partition key and pay-per-request billing — there is no read or write volume to optimise for.

The OIDC side is the GitHub identity provider (token.actions.githubusercontent.com with sts.amazonaws.com as the audience), one IAM role with an AssumeRoleWithWebIdentity trust policy, and one IAM policy attached to that role. The policy is broad — it grants near-full access to the services this project actually touches: ECS, ECR, IAM (with a permissions boundary), VPC reads and security groups, ELB, CloudWatch, Secrets Manager, scoped S3, KMS, ElastiCache, SSM, RDS, Route 53, and the state-backend resources themselves.

I am not going to pretend that policy is tight. It is broad on purpose because the account contains exactly one project. In a multi-tenant account it would be wrong; in a single-account project it is honest, easier to reason about than a maze of narrowly-scoped statements, and not a security tradeoff anyone is actually making — the IAM user that ran the bootstrap already had administrator access. The role just inherits a slightly smaller version of the same trust.

01 / STATE BACKEND02 / CI TRUSTS3terraform-state bucketversioned · encrypted · prevent_destroyDDBterraform-locks tableLockID partition key · pay-per-requestOIDCtoken.actions.githubusercontent.comaccount-level · imported if it already existsROLEgithub-actionstrust: sub StringLike repo:org/repo:*POLICYbroad · single-tenant on purposeECS · ECR · IAM · VPC · S3 · DDB · KMStrustsattachesallow R/W03 / ACCOUNT-WIDESLRService-linked roles · ECS · ELB · RDS · ElastiCache — imported if pre-existing
What the bootstrap module creates. The trust chain on the right gates role assumption; the policy reaches back into the state backend on the left.

The trust policy is where security actually lives

The interesting line is not in the policy attached to the role. It is in the role's trust document, where two conditions narrow who can assume it:

"Condition": {
  "StringLike": {
    "token.actions.githubusercontent.com:sub": [
      "repo:<your-org>/<infra-repo>:*",
      "repo:<your-org>/<app-repo>:*"
    ]
  },
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
  }
}

The aud check is the cheap one — GitHub's OIDC tokens for AWS always carry sts.amazonaws.com as the audience, and verifying it costs nothing. The sub check is the real boundary. `repo:<org>/<repo>:*` means any workflow in that repository, on any branch, in any environment, can assume this role. For a single-account project that ships from main after review, this is the smallest acceptable scope and it is also where I would push back on tighter is better.

You can scope further — `repo:<org>/<repo>:ref:refs/heads/main` ties role assumption to one branch, and `repo:<org>/<repo>:environment:prod` ties it to a GitHub Environment with approvals. Both are upgrades worth making, but only after the workflow is stable enough that branch and environment names are not still in flux. Tightening the trust policy on every workflow rename is a chore I have seen teams skip after the third time, which is worse than starting wider and tightening once on a quiet Friday.

The migration trick

The piece that makes the whole module self-bootstrapping is two lines of terraform init plus a sed on the backend block. On a clean clone, the first run is terraform init -backend=false. That tells Terraform to ignore the declared backend "s3" {} block, use a local state file, and proceed. The first terraform apply then creates the bucket and the lock table. After that, the module flips the backend declaration in place — in practice, by writing a per-environment envs/<ENV>/bootstrap-backend.hcl (gitignored, so it can't accidentally point at another account's bucket) — and runs terraform init -migrate-state -force-copy. Terraform reads the local state, uploads it to the bucket it just created, deletes the local file, and from then on the module manages itself from S3 with DynamoDB locking, the same as everything else.

The reason this matters: the entire bootstrap is reproducible. A new engineer can clone the repo, set two environment variables, run one make target, and end up with a fully-bootstrapped account with no human steps in between. Hand-created buckets do not have this property.

Here is what the load-bearing pieces look like in the Makefile that drives them — the init switch, the import guard, and the migration flip:

# bootstrap/Makefile (excerpt — the chicken-egg-cracking targets)

# main.tf declares `backend "s3" {}` from the start.
# We init with -backend=false until the per-env backend HCL has been written.
init:
	@if [ -f envs/$(ENV)/bootstrap-backend.hcl ]; then \
	  terraform init -backend-config=envs/$(ENV)/bootstrap-backend.hcl -reconfigure; \
	else \
	  terraform init -backend=false -reconfigure; \
	fi

# Step 1 — targeted apply, only the resources that BECOME the backend
apply-backend: init
	terraform apply $(TF_VARS) \
	  -target=aws_s3_bucket.terraform_state \
	  -target=aws_dynamodb_table.terraform_locks \
	  -auto-approve

# Step 2 — import any pre-existing OIDC provider so apply doesn't collide
import-oidc: init
	@ARN="arn:aws:iam::$$(aws sts get-caller-identity --query Account --output text):oidc-provider/token.actions.githubusercontent.com"; \
	if aws iam get-open-id-connect-provider --open-id-connect-provider-arn "$$ARN" >/dev/null 2>&1 \
	   && ! terraform state show aws_iam_openid_connect_provider.github >/dev/null 2>&1; then \
	  terraform import $(TF_VARS) aws_iam_openid_connect_provider.github "$$ARN"; \
	fi

apply-oidc: init import-oidc
	terraform apply $(TF_VARS) $(OIDC_TARGETS) -auto-approve

# Step 5 — write the per-env backend HCL, then move local state into S3
migrate-bootstrap-state:
	@mkdir -p envs/$(ENV)
	@{ echo "bucket         = \"$$(terraform output -raw state_bucket_name)\""; \
	   echo "key            = \"$(PROJECT)/$(ENV)/bootstrap.tfstate\""; \
	   echo "region         = \"$(AWS_REGION)\""; \
	   echo "dynamodb_table = \"$$(terraform output -raw dynamodb_table_name)\""; \
	   echo "encrypt        = true"; \
	 } > envs/$(ENV)/bootstrap-backend.hcl
	@sed -i.bak 's/backend "local" {}/backend "s3" {}/' main.tf && rm -f main.tf.bak
	terraform init -backend-config=envs/$(ENV)/bootstrap-backend.hcl \
	  -migrate-state -force-copy

Nothing in there is clever — that is the point. It is forty lines of shell glue around three Terraform commands, and it gives a fresh AWS account the same one-shot story as a Control-Tower spoke.

MAKEFILE TARGETEFFECT01 / apply-backendtargeted apply of S3 + DDBcreates the state backendterraform.tfstate still on disk02 / apply-oidcimport-then-applyOIDC + role + policyimports pre-existing provider · SLRs03 / generate-backendreads bootstrap outputswrites terraform/backend.hclbucket · key · region · lock table · encrypt04 / init-mainterraform init in ../terraformmain module wired to S3 backendfirst plan uses remote state05 / migrate-bootstrap-statesed flip + init -migrate-statebootstrap state moves into S3self-managing from here on06 / FROM CIGitHub Actions assumes the role via OIDC and runs terraform against the same backend
Five make targets, one apply per step, one migration. After step five the bootstrap module reads its own state from S3; step six is what every CI run looks like forever after.

Failure modes worth designing for

Two things bite the first time you run the module against an account that is not perfectly clean.

The OIDC provider for token.actions.githubusercontent.com is an account-level resource. If anything else in the account has created it — another stack, an AWS solution, a colleague's experiment — the second aws_iam_openid_connect_provider will fail with a "provider already exists" error. The fix is to check via the AWS CLI before applying, and terraform import the existing ARN into state if it is found. The Makefile in the module encodes this check so it runs automatically.

The same pattern repeats for AWS service-linked roles — the AWSServiceRoleForECS, ...ForElasticLoadBalancing, and ...ForRDS roles. They are account-wide, often pre-created by the console the first time you open one of those services, and a Terraform aws_iam_service_linked_role resource will refuse to create one that already exists. Same import-before-apply trick, same Makefile target, four resources.

The counterargument to managing service-linked roles in Terraform at all is reasonable: they are AWS-owned, they cannot be customised, and importing them is busywork. I keep them in the module because the alternative is a tribal-knowledge list of "things you also have to make sure exist before this account works." The import is cheap; the documentation it replaces is not.

So what

A single-account AWS project — the kind that ships from one repo, has one engineer most weeks, and never sees Control Tower — can have the same audit-friendly, reproducible-from-zero CI story as a multi-account org. It does not require a different tool. It requires one Terraform module with a placeholder backend, a Makefile that knows how to migrate state and import the resources AWS pre-creates, and an IAM policy that is honest about being broad because there is only one tenant.

The whole thing is roughly four hundred lines of HCL plus a Makefile. If you are starting a small project, that is the entire keyless-CI investment. Spend it once, and never put a long-lived AWS key in a GitHub secret again.

Comments

Sending…

100 / 100

By posting a comment, you accept the Terms of Use.