The four-layer split that keeps multi-brand IaC sane

Multi-brand infrastructure drifts into copy-paste by default. A new brand starts as a clone of the last one, the clone accumulates local edits, and six months later a fix to a shared module has to be applied by hand to every clone — or quietly isn't. The four-layer Terragrunt hierarchy I keep landing on exists to make that drift structurally impossible, not just discouraged.

The argument is narrow: most teams treat "share modules, override at the leaf" as the rule and call it a day. That rule is necessary but not sufficient. What actually prevents the drift is being explicit about which layer is allowed to know what, and refusing to let any layer reach down.

The four layers

Each layer is narrower than the one above. Reads go up; writes never go down.

Root (root.hcl) owns remote state and provider generation. It knows the S3 bucket, the DynamoDB lock table, and which AWS provider version is approved. Nothing below it duplicates this. If you ever feel the urge to override the provider version in a leaf, you have already lost.

Brand (brand.hcl) centralises module sources and image URIs. It knows which modules a brand uses and which ECR repositories hold its container images. Adding a new brand is exactly one new file at this layer — that is the test for whether the layer is doing its job.

Environment (env.hcl) holds the size knobs: instance types, min/max replica counts, RDS storage, retention. It knows nothing about modules or images — only about scale parameters that differ between staging and production. When somebody slips a module override into env.hcl, you have just merged two concerns that should never share a file.

Leaf (terragrunt.hcl) wires dependency outputs into typed inputs. It calls a module once, passes the right variables, and declares which other leaves it depends on. If it needs to know the VPC ID, it reads it from the VPC leaf's output — not from a hardcoded string, and not from a re-declared variable somewhere in the brand file.

What each layer is allowed to know

The constraint is strict, and the strictness is the point: each layer may reference the layers above it, never below. A leaf may read env.hcl and brand.hcl to discover its inputs. A brand file may not reach into an individual leaf to override its behaviour. An environment file may not reach into a leaf to special-case staging.

This one-way dependency is what makes the hierarchy composable rather than entangled. The moment you allow a higher layer to reach into a lower one, you have re-invented the global mutable state you were trying to escape — except now it is hidden inside HCL.

The practical effect compounds:

A new environment is one env.hcl file edited.
A new brand is one brand.hcl file created and one directory copied.
A module version bump happens in brand.hcl and propagates to every leaf in that brand on the next Atlantis plan.
A new region for an existing brand is cp -r of an environment directory and a single line change.

None of these operations should require touching shared code. If they do, a layer is leaking.

A worked example: adding a new brand

Concretely, here is what "add a new brand" looks like in a well-shaped hierarchy. I have done this enough times that the diff fits in head:

# live/brand-c/brand.hcl
locals {
  brand           = "brand-c"
  module_versions = {
    ecs_service = "v1.42.0"
    rds_aurora  = "v0.9.3"
    waf_rules   = "v2.1.0"
  }
  image_repos = {
    api    = "123456789012.dkr.ecr.us-east-1.amazonaws.com/brand-c-api"
    worker = "123456789012.dkr.ecr.us-east-1.amazonaws.com/brand-c-worker"
  }
}

# live/brand-c/staging/env.hcl
locals {
  env             = "staging"
  api_min_size    = 1
  api_max_size    = 3
  rds_instance    = "db.t4g.medium"
  rds_storage_gb  = 50
  retention_days  = 7
}

# live/brand-c/staging/api/terragrunt.hcl
include "root"  { path = find_in_parent_folders("root.hcl")  }
include "brand" { path = find_in_parent_folders("brand.hcl") }
include "env"   { path = find_in_parent_folders("env.hcl")   }

dependency "vpc" { config_path = "../vpc" }

terraform {
  source = "${include.brand.locals.module_versions.ecs_service}"
}

inputs = {
  service_name   = "api"
  image_uri      = "${include.brand.locals.image_repos.api}:${include.env.locals.image_tag}"
  min_size       = include.env.locals.api_min_size
  max_size       = include.env.locals.api_max_size
  vpc_id         = dependency.vpc.outputs.vpc_id
  subnet_ids     = dependency.vpc.outputs.private_subnet_ids
}

Three small files and a directory copy. No shared module is touched. No production brand is touched. No environment override leaks into a leaf. The PR diff is small enough that a reviewer can hold the entire change in their head, which is the only review style that scales when you have several brands across several regions.

The trade-off: indirection tax

The honest counter is that this layering imposes an indirection tax on first-time readers. A junior engineer opening brand-c/staging/api/terragrunt.hcl for the first time will not see the actual values flowing into the module. They will see include.brand.locals.module_versions.ecs_service and have to chase the include up the tree. Compared with a flat single-file Terraform module — which a newcomer can read top to bottom in five minutes — this is a real cost.

I think it is worth it, because the cost is paid once per onboarding and the savings compound across every change after that. But I would not adopt this layout for a single-brand single-environment setup. Two brands and two environments is roughly where the layered approach starts to dominate; below that, the indirection is overhead.

Counter-argument: why some teams reject this

Some teams I respect refuse layered Terragrunt entirely. Their argument is not weak: every layer of indirection is a place where the magic can hide, and "the magic is hidden" is a bad property in an outage at three in the morning. They prefer fully expanded Terraform per environment, generated by a script if needed, with no Terragrunt run-time merging. The values you read in the file are the values that go to the API.

That stance is defensible in two cases. The first is small fleets where copy-paste is genuinely cheaper than indirection. The second is teams where the on-call engineer is not the author of the IaC, and the debugging cost of include-chasing exceeds the maintenance cost of duplication. If either describes your environment, do not adopt this.

For everyone else: the layered split moves the question from "did everyone remember to copy the fix?" to "did the fix land in the right layer?". The second question has a definite answer; the first only ever has a hopeful one.

So what

If you maintain Terragrunt for more than one brand or more than two environments, audit your repository against the one-way-knowledge rule this weekend. Grep for any reference inside brand.hcl that points at a specific leaf. Grep for any env.hcl that overrides a module source. Each hit is a future drift you have already paid for, you just have not collected the bill yet.

If you are starting fresh, pick the four-layer split, declare the upward-only knowledge rule in the repo README, and refuse the first three PRs that violate it. After the third one, nobody violates it again.

[VERIFY: Atlantis plan propagation through Terragrunt's include merge happens on the next plan cycle for all consumers; confirm against the Terragrunt v0.55+ docs if quoting verbatim.]