IaC and Kubernetes: The Two-Layer Control Plane for AI Native Infrastructure
This is Part 3 of a three-part series on AI Native Infrastructure. Part 1 covers GPU cluster management. Part 2 covers agent platform engineering. This post covers IaC and Kubernetes as the two-layer control plane that makes both work at scale.
At hyperscale, managing GPU infrastructure without IaC is not a workflow — it's a liability. Companies like Meta operate GPU clusters at a scale where configuration drift, firmware inconsistency, or an undocumented network topology change can silently degrade a week-long training run. IaC is how you make infrastructure state auditable, reviewable, and reproducible.
But IaC alone isn't sufficient. It's worth asking: what exactly is Terraform managing? And what is it not managing?
The answer to that question reveals something important about how AI Native infrastructure actually needs to be governed — and why Kubernetes, despite not being designed for GPU workloads, remains the right runtime control plane for both the infrastructure layer and the application layer above it.
