Austin Xu Blog

DevOps Is a Culture, Not a Team: What I've Learned Building at Scale

2026-04-26T00:00:00.000Z

Every organization that has gone through a "DevOps transformation" in the last decade has a story. Most of those stories end the same way: they hired a DevOps team, bought a set of tools, and then wondered why things didn't meaningfully change.

I've been building and running infrastructure at scale for 20 years — from private cloud on OpenStack at eBay to managing 200+ Kubernetes clusters, 50,000 nodes, and 5,000+ applications. If there's one thing I've learned, it's that the most common implementation of DevOps is actually an anti-pattern.

Let me explain what I mean.

DevOps Is a Culture, Not a Team

The first and most important thing to understand about DevOps is that it is a cultural movement, not a tool, a team, or a job title.

DevOps was born in 2009 out of a simple frustration: "dev ships, ops runs" creates systemic accidents. When the team that builds a service doesn't run it, they build it differently — less observably, with less concern for operational burden, with less skin in the game when 3am pages fire. The solution isn't tooling. The solution is ownership.

The core mental model is: "you build it, you run it." Teams own their services in production end-to-end. That changes everything about how software gets built.

The classic anti-pattern is creating a "DevOps team." At a surface level, this seems reasonable — put some experienced engineers in a room, give them the title, and let them drive the transformation. But what you've actually done is recreate the original dev-vs-ops silo as a new gatekeeper. The DevOps team becomes the bottleneck, the org breathes a sigh of relief at having "done DevOps," and the culture doesn't change at all.

The CALMS framework captures the full scope: Culture, Automation, Lean, Measurement, Sharing. Most organizations focus on Automation (which is important) and skip Culture and Measurement (which are harder). That's why most DevOps transformations fail — not because the tools are wrong, but because the work that matters most isn't about tools.

Platform Team: The Operationalized Version of DevOps at Scale

In a previous post, I wrote about the difference between ops thinking and platform thinking. Platform Engineering is where DevOps becomes operationalized at scale.

At small scale — under 50 engineers — every team can do DevOps from scratch. Everyone knows each other, context is shared, incidents are rare. You can get by without formal DevOps investment.

But at larger scale, the "every team does DevOps from scratch" approach breaks down fast. Each team is reinventing CI/CD, building their own observability stack, figuring out their own on-call process, and doing all of it slightly differently. The cognitive load multiplies across the organization. Engineers burn out on infrastructure concerns instead of building product.

That's where Platform Engineering comes in. A Platform team builds the paved path — a curated, opinionated set of internal tools that other teams consume as a product. CI/CD that just works. Terraform modules that encode your organization's security and cost policies. Observability out of the box. On-call runbooks that don't require each team to rediscover what APF misconfiguration looks like at 2am.

Critical point: the platform team is not doing application ops. They're running the platform itself — the K8s clusters, the CI/CD system, the observability stack — and treating other engineering teams as their customers. Their metric isn't uptime; it's the productivity and autonomy of the teams they serve. Developer experience is the NPS score.

The Ops Lifecycle: More Than Provisioning and Deploying

When people describe Ops work, they usually reach for the visible tools: Terraform for provisioning, CI/CD for deployment, monitoring for observability. That's a useful shorthand, but it misses where the real complexity lives.

The full service lifecycle looks like this:

Design → Provision → Deploy → Operate → Evolve → Retire

Provisioning and deployment are the starting line. The sharp-edge is Operate — managing cognitive load across hundreds of services, sustaining a healthy on-call rotation, and keeping engineers in "engineering mode" rather than "firefighting mode."

At scale, without the right practices, Operate becomes a tar pit. Incidents are handled ad hoc by whoever happens to be available. The same problems recur because postmortems don't produce real follow-up. On-call engineers burn out, institutional knowledge walks out the door, and you're back to square one. That's why Platform Engineering exists — not primarily to make provisioning easier, but to make operating sustainable.

This also changes how you think about automation. Automating provisioning is table stakes. The higher-leverage automation is in Operate: AI-assisted incident triage, automated rollback, error budget dashboards that make reliability debates into math problems rather than politics.

IaC: Not Just Terraform, But a Mindset Shift

I should be transparent about something: eBay didn't use Terraform extensively for our Kubernetes fleet. Our tooling was homegrown — a GitOps-based system with per-cluster YAML configurations, a custom controller (which we called Releaser) that reconciled desired state toward actual state, and eventually ArgoCD as the community-supported path.

But the IaC mindset was the same: infrastructure as code means infrastructure is reviewable, auditable, and rollback-able. When a configuration change causes an incident, you have a git history that tells you what changed, who changed it, and when. When a new cluster needs to be built, you're not reconstructing it from institutional memory — you're applying a known-good spec.

The bigger shift at scale was establishing what I'd call shared automation contracts. Our cluster fleet — 200+ clusters, 20+ types including Kubernetes control planes, Hadoop, API gateways — had no common interface across the component teams responsible for each lifecycle phase: network, security, app lifecycle, Hadoop. Every cluster type had its own manual process with no reuse.

We fixed this by defining a common interface each component team could implement independently. Once that contract existed, automation could be built once and applied across all cluster types. The decommission process that used to take several weeks dropped to a few days, and the capacity team gained self-service control over the entire flow. The cluster build pipeline for the API gateway hit a one-week target. The value wasn't the automation itself — it was the shared contract that made the automation generalizable.

This is what IaC thinking really means: not just "write Terraform," but "make infrastructure an agreed interface that multiple teams can reason about, review, and automate against."

CI/CD: What Good Actually Looks Like

CI/CD is where DevOps investment becomes most visible — and where the gap between CI/CD theater and real practice is most apparent.

CI/CD theater looks like this: the pipeline exists, tests run, deployments happen. But teams still batch commits for a weekly deploy. Change review is a rubber stamp. Canary deployments are disabled because they're "too complicated." Nobody has agreed on what a rollback means. DORA metrics? Never heard of them.

Real CI/CD practice is measured. The DORA framework provides four metrics that correlate with high-performing engineering organizations: Deployment Frequency, Lead Time for Changes (commit to production), Change Failure Rate, and Time to Restore Service (MTTR). Elite-tier teams deploy on demand with sub-one-hour lead time and sub-one-hour MTTR. Those numbers aren't goals; they're outcomes of mature practice.

At eBay, my team owned the Cloud Control Plane CI/CD pipeline — Prow for CI with mandatory e2e test gates, Releaser for GitOps-based CD across 200+ clusters. The more interesting piece was the Federated Deployment Controller we built: a custom Kubernetes controller that orchestrated progressive rollouts cluster by cluster, queried an AI-based health detector for automated go/no-go signals, and triggered automatic rollback on degradation.

We built it for our own control plane deployments. The ECD team — responsible for CI/CD for hundreds of application teams — adopted it as their standard multi-cluster CD mechanism. That's the infrastructure team leverage pattern: build something for your own workload that turns out to generalize. The key wasn't the technology; it was the model. Progressive delivery with automated health gating removes a whole class of "we deployed and broke everything" incidents. Change management becomes controllable, not just visible.

Infrastructure-side, the reliability lessons were equally important: dedicated CI/CD node pools (so pipeline spikes can't starve production scheduling), API Priority and Fairness to rate-limit CI/CD traffic against the Kubernetes API server, and a gateway layer to absorb burst job submissions. We learned these the hard way — CI/CD-induced API server overload was a real incident class before we addressed it structurally.

SLO/SLI: Turning Reliability Into a Currency

SLOs are often presented as a measurement exercise: define your SLI, set a target, track it. That framing misses the real value.

SLOs turn reliability into a shared currency between product and engineering. The error budget is the mechanism. You have N minutes of allowed downtime per month (based on your SLO). As long as you're within budget, ship aggressively. When you're burning through budget, pause feature work and invest in reliability. The debate — "should we slow down?" — stops being a political argument and becomes a math question: "what does the error budget say?"

I learned this first-hand implementing SRE practices for eBay's Kubernetes API server fleet. When I took over, the Federated API Server — the primary entry point for all platform clients — was operating below 90% availability. The worst incident: a Dev API Server stayed down for two full days. No SLOs, no runbooks, no on-call rotation. Every outage was handled ad hoc.

The first decision that mattered: I set the initial SLO at 99%, not 99.9%. This was counterintuitive to leadership. But an SLO you can't sustain is worse than no SLO — it teaches engineers to ignore the signal. Starting at 99% from a sub-90% baseline gave meaningful headroom to improve without constant error budget exhaustion. We graduated to 99.9% as infrastructure stabilized and the team developed confidence in the measurement methodology.

The graduated approach mattered. SLOs also serve as an internal triage signal — when availability starts dipping, you know where to look before it becomes an incident. We used built-in Kubernetes Prometheus metrics as the SLI source (no parallel instrumentation project required), established a burn rate alert policy, and set up regular cadence to review incidents and burn rate.

Results: Federated API Server went from below 90% to 30-day rolling average of 99%+ in dev environments, sustained 99.9% in production. MTTD reduced to ~20 minutes. MTTR dropped from 24+ hours (worst case) to under one hour for defined incident classes.

One operational detail that mattered: we tracked postmortem action items in the sprint backlog, not on a separate postmortem board. Teams that park RCA items on a separate tracker find those items never getting resolved. Mainlining them into sprint planning made follow-through the default.

High Scalability: Where DevOps Value Compounds

There's a common framing that gets the causality backwards: "at high scale, you need DevOps." That's true, but the more important insight is the opposite: DevOps investment is what makes high scale possible in the first place.

At small scale, manual ops works fine. Teams know each other, incidents are rare, context is shared. DevOps is nice-to-have. At 200+ engineer teams running 5,000+ applications across 50,000 nodes, none of that is true. Without IaC, provisioning becomes the bottleneck. Without mature CI/CD, release velocity drops as change risk grows. Without SRE practices and observability, incident response burns out the on-call rotation.

The math is simple: if every team that needs to deploy does so manually, your deployment capacity grows linearly with headcount. If you invest in CI/CD that enables self-service deployment, it grows independently. At eBay, we were handling 35,000+ deployments per week across 20,000+ app pools. That number is only possible because of platform-level automation — no team of humans could manage it manually.

The flywheel is: DevOps investment reduces operational friction → engineers spend more time building → more capacity for building better DevOps tooling → further reduction in friction. Elite-tier teams aren't faster because they work harder. They've removed the friction that scale introduces.

This is also why Platform Engineering emerges naturally at scale. It's not an organizational preference — it's the only architecture that doesn't require linear headcount growth to maintain linear reliability. A platform team of 10 engineers supporting 500 developers is leverage. 500 developers each spending 20% of their time on undifferentiated DevOps work is waste.

The AI Factor: DevOps as the Foundation for AI-Native Development

The next inflection point is already visible: AI-native applications — LLMs, inference pipelines, model training workflows — introduce a new class of infrastructure challenges on top of everything DevOps already handles.

Model training needs burst GPU provisioning at a scale that breaks traditional IaC assumptions. Inference serving requires rapid scaling with traffic patterns that differ fundamentally from web services. Experiment tracking and model versioning add new dimensions to deployment management. And the failure modes — model degradation, distribution shift, silent accuracy regression — aren't caught by traditional application metrics.

The good news: everything discussed above is the foundation for AI-native infrastructure. If your organization has mature IaC, CI/CD with health-gated progressive delivery, and SLO discipline — you're far better positioned to extend those practices to AI workloads than an organization starting from scratch.

The bad news: many organizations are trying to build AI-native capabilities on top of immature DevOps foundations. You can't reliably run LLM inference at scale if you don't have basic SLOs. You can't safely promote model updates if your deployment pipeline lacks automated rollback. The problems compound.

The organizations that will do AI-native infrastructure well are the ones that treated DevOps as a serious discipline — not theater — before AI made everything more complex.

After 20 years, the part I've changed my mind about most is this: I used to think DevOps was primarily a technical problem. Better tools would lead to better outcomes. What I've learned is that the technical problems are the easy part. The hard part is getting engineers to own their services in production, building a culture where postmortems produce real change, and investing in platform capabilities that reduce friction for everyone.

Tools follow from culture. Culture doesn't follow from tools.

The teams I've seen do this best share one characteristic: they didn't wait for a "DevOps transformation" initiative to start. They just started building the habits — shared ownership, measurement, automation, blameless retrospectives — and the tools emerged from those habits as expressions of clear principles.

That's what DevOps is. Everything else is just technology.

Stacking OpenSpec and Superpowers: A Combined SDD Workflow

2026-04-19T00:00:00.000Z

This is a follow-up to From Vibe Coding to Spec-Driven Development. That post documented introducing OpenSpec into an existing Finance project. This one covers a new project where I stacked OpenSpec with Superpowers from day one.

After three months of running OpenSpec on my Finance project, I'd formed a clear picture of what it's good at and where it struggles. On a personal wiki project I'd also been using Superpowers, and its brainstorming, TDD, and code-review skills were landing real hits.

So I started a new project — a UTR-based tennis team lineup app (tennis-lineup) — specifically to run both tools together and see how they compose. This post is the report.

Where OpenSpec Alone Fell Short

OpenSpec is excellent at incremental change management. The propose → apply → archive loop keeps a clean spec library, and the archive step preserves delta specs as an audit trail. For shipping features one at a time on an existing codebase, it's hard to beat.

But over three months on the Finance project, four gaps became obvious:

Project bootstrap is weak. OpenSpec drives changes well. On a brand-new project — where you need architecture, initial tech stack decisions, domain modeling — opsx:propose has nothing to anchor on.
Specs capture intent, not interaction. The spec tells AI what to build. UI details are often underspecified, and the first implementation rarely matches what I had in my head.
Tasks list what, not how. tasks.md is a checkbox list. There's no implementation plan per task, so AI fills gaps on its own — sometimes fine, sometimes off-track. Tasks get silently skipped.
No test discipline. I pushed "tests first" into config.yaml, but code quality was still inconsistent. Manual testing kept finding bugs a proper TDD cycle would have caught.

Where Superpowers Fills the Gaps

Superpowers is a collection of Claude Code skills. The three that matter most for SDD work:

superpowers:brainstorming — walks you through structured questions before writing any spec. Has a Visual Companion that generates HTML mockups you click through in the browser. Outputs a design spec at the end.
superpowers:writing-plans + executing-plans — decomposes a spec into tasks, and each task into a red/green/refactor TDD sequence with exact file paths, commands, expected test output, and commit messages.
superpowers:requesting-code-review — runs automatically after each task. Flags issues at CRITICAL / HIGH / MEDIUM / LOW levels with specific fixes.

OpenSpec covers change management and long-term spec accumulation. Superpowers covers upfront design, execution discipline, and review. They operate at different layers. They don't conflict — they stack.

The Combined SDLC

Here's the workflow I settled on for tennis-lineup:

Project init (once):

superpowers:brainstorming to pin down requirements and architecture. Outputs a design doc I commit to docs/.
openspec init, then populate config.yaml — tech stack, conventions, test strategy (unit + integration + e2e) stated upfront.
Slice requirements by priority. Treat docs/log/ as a living journal; require every session to append to it via CLAUDE.md.

Per feature:

If the feature is fuzzy, start with superpowers:brainstorming again. Use Visual Companion for anything UI-heavy. The output is a design spec.
Run opsx:propose with that design spec as input. Get proposal.md, design.md, and tasks.md.
Run Superpowers TDD against those tasks. It writes tests first, implements, runs the suite, and runs code review per batch.
Manual sanity check. Rework rate has been low.
opsx:apply to verify everything is ticked off — Superpowers usually already closed everything.
Deploy.
opsx:archive to merge the delta spec back into the main spec library.
Scan the day's log — lift any new gotchas into CLAUDE.md and openspec/config.yaml. Update the README.
Commit and push. Next feature.

The two tools hand off at clean boundaries: Superpowers designs and builds; OpenSpec tracks and archives.

One Feature End-to-End: Mobile UI Optimization

To show how this plays out in practice, here's a real feature from 2026-04-19.

The prompt:

/superpowers:brainstorming I want to use the UI mock capability to optimize
the UI for mobile:
  - sidebar should be hideable
  - when the right side has two columns (e.g., team management), hide the
    team list once a team is selected
  - two-column card layouts (e.g., saved lineups) should collapse to a
    single column on mobile

Step 1 — Brainstorming + Visual Companion. Superpowers generated HTML mockups I clicked through in the browser, then wrote a design spec.

Step 2 — Propose. opsx:propose took the design spec and produced an OpenSpec proposal and task list — 8 groups, 37 tasks.

Step 3 — TDD execution in batches. A representative batch result:

Batch 1 — Global shell (no backend changes)
Commit: 2e336bb

- New AppHeader.vue: hamburger menu + optional back button + title +
  actions slot; lg:hidden
- Refactored MainLayout.vue: provide('sidebarOpen', ref(false));
  Escape closes drawer
- Decision: SidebarDrawer.vue logic folded into MainLayout (no separate
  component needed)

Code review fixes:
  HIGH    inject fallback used a plain object (not a ref) — silently failed standalone
          → inject('sidebarOpen', ref(false))
  HIGH    hamburger aria-label was static "open nav" regardless of state
          → dynamic :aria-label + :aria-expanded
  HIGH    back button missing aria-label + type="button"
          → added, safe inside forms
  HIGH    removing pt-14 from MainLayout caused content to hide behind fixed header
          → restored pt-14 lg:pt-0
  MEDIUM  drawer had no Escape-to-close (WCAG 2.1.2)
          → global keydown listener in MainLayout
  LOW     missing overflow-x-hidden could flash horizontal scroll
          → added

Tests: 381 passing (16 new AppHeader tests)

This is the Superpowers loop working as designed: write test → implement → run → review → fix findings → commit. Every batch ended green.

Step 4 — One honest mistake. On the Opponent Analysis component, the UI mock only showed the mobile layout and I approved it without re-reading the existing spec. Superpowers implemented against the mock and silently dropped existing desktop functionality.

Fix:

Re-ran superpowers:brainstorming with the existing spec loaded, re-did the UI design.
Continued Superpowers TDD against the revised design.

Lesson went into openspec/config.yaml so the next change won't repeat it.

Step 5 — Deploy, archive, update docs. opsx:archive synced the delta spec back. CLAUDE.md got updates for E2E dual-render pitfalls, SOCKS5 proxy setup, and Windows localhost dual-stack. config.yaml got dual-render data-testid, backend restart rules, TOCTOU, deploy smoke tests, and "sync before archive."

Full detail in the 2026-04-19 log.

Result:

The Numbers

Window: 5c612fe (design spec) to ee4c3bd (archive) — 3 hours 7 minutes, including brainstorming, 6 implementation batches, one rollback + rework, E2E fixes, deploy, and archive.

Code delta (git diff 3f75465..ee4c3bd):

Category	Files	+/−	Net
Vue source	8	+741 / −297	+444
Tests (unit + E2E)	11	+1094 / −16	+1078
Docs / specs / config	22	+2547 / −10	+2537
Brainstorming mockups	19	+4348 / −0	+4348
Total	60	+8730 / −323	+8407

Tests added:

Unit: +77 (365 → 442)
E2E: +9 (44 → 53)
Total: +86 test cases

Timeline:

Time	Event
09:00–09:23	Brainstorming + design spec + implementation plan
09:23–09:39	Batch 1: AppHeader + MainLayout (+16 tests)
09:39–09:54	Batch 2: TeamManagerView + TeamDetail (+20 tests)
09:54–10:00	Batch 3: LineupCard (+12 tests)
10:00–10:15	Batch 4: LineupHistoryView (+9 tests)
10:15–10:36	Batch 5: LineupGenerator (+3 tests)
10:36–10:49	Batch 6: OpponentAnalysis rewrite (+46) — rejected by user
10:49–10:50	Revert + minimal mobile adaptation
11:46	Batch 13: OpponentAnalysis redesign TDD (+15 tests) + E2E fixes
12:07	Deploy to fly.io + archive

ROI observations:

444 net source lines produced 86 test cases (test-to-code ratio ≈ 2.4:1 by line count).
Biggest time sink was Batch 6 — the over-refactor and revert. Lesson cemented in config.yaml.
20 minutes of HTML mockups up front bought 2 hours of zero-rework implementation.

`tasks.md` vs `plan.md`: What Each Is Good For

OpenSpec's tasks.md and Superpowers' plan.md are not the same artifact. They operate at different granularities and for different readers.

Dimension	OpenSpec `tasks.md`	Superpowers `plan.md`
Length	~60 lines	~1300 lines
Granularity	8 groups × 37 tasks, one sentence each	11 Tasks × 4–9 Steps each (write test / run / implement / verify / commit)
Code blocks	❌	✅ Full Vue templates, JS, test cases
File paths	Component names only	✅ Exact paths (`frontend/src/components/AppHeader.vue`)
Commands	`mvn test` / `npm test`	✅ Precise commands + expected output (`Expected: FAIL — ...`)
Testing	"add/update tests"	✅ Runnable TDD red-green-refactor
Commits	One per group (~8 commits)	✅ Exact commit message per Task (~11+ commits)
Self-check	Spec scenario → task mapping is implicit	✅ Spec coverage checklist at the end
Risk notes	In `design.md`	✅ Inline (e.g., "Task 8.7 depends on backend")
Reader assumption	Developer who knows the repo	Engineer with zero context can follow it cold

When each wins:

tasks.md is good for:

Fast scope review and checkbox tracking
Confirming "is the feature done?" (applyRequires lives here)
Cases where you or the AI already know the implementation details

plan.md is good for:

Handing work to a fresh engineer or a subagent with no context
Strict TDD — red / green / refactor made explicit at every step
Small-granularity commits that are easy to git bisect
Dispatching tasks to superpowers:subagent-driven-development

How I actually use both: tasks.md is the scope contract with OpenSpec. plan.md is the execution script for Superpowers. They share the same spec as source of truth — they just serve different phases.

So Is OpenSpec Still Needed?

With Superpowers this capable, is there still a reason to run OpenSpec?

For me, yes — for three reasons.

Small-step iteration discipline. propose → apply → archive is a hard rhythm. It forces each change to have explicit scope, acceptance, and an archive step.
Long-term spec library. opsx:archive syncs delta specs into a growing openspec/specs/ tree. Over months, this becomes the project's authoritative specification — similar to how I treat my LLM wiki as the core notebook. Superpowers' specs and plans live per-change; they don't accumulate a project-level view.
Cross-check on completeness. OpenSpec tasks and Superpowers plans can be diffed against each other. If the Superpowers run finished but an OpenSpec task is still open, something was missed.

Short version: OpenSpec owns the spec lifecycle. Superpowers owns the design-and-execute loop inside a change. The two together give me both long-term structure and per-change rigor.

Token Cost

This change used ~180M tokens — driven by Opus 4.7 plus Visual Companion (which generates and iterates on HTML mockups). I'm on the Claude Code Max plan, so the actual out-of-pocket cost is fixed. For the delivered output — 444 source lines, 86 tests, design-to-archive in 3 hours with near-zero rework — it's acceptable.

Key Takeaways

1. Start every new project with brainstorming, not proposing. OpenSpec can't bootstrap architecture from a one-liner. Superpowers' structured questioning can.

2. Use Visual Companion for any UI change. 20 minutes of clickable mockups prevents hours of mismatched implementation. This was the single highest-leverage tool in the workflow.

3. Let Superpowers enforce TDD. Let OpenSpec enforce archival. Don't expect either to do both well.

4. Every mistake goes into config.yaml. The Batch 6 over-refactor is now a prevention rule. This is the compounding advantage of SDD over Vibe Coding — mistakes turn into structure, not just git history.

5. Keep a per-day log. The docs/log/YYYY-MM-DD.md habit makes retrospectives cheap and fuels CLAUDE.md / config.yaml updates.

References

Project

tennis-lineup on GitHub — full source, including CLAUDE.md, OpenSpec config, and day logs
Mobile UI optimization change (archived)
2026-04-19 session log

Tools

OpenSpec — lightweight SDD CLI
Superpowers — Claude Code skills for brainstorming, TDD, and code review

Related

From Vibe Coding to Spec-Driven Development — the prior SDD post this one builds on

IaC and Kubernetes: The Two-Layer Control Plane for AI Native Infrastructure

2026-04-09T10:00:00.000Z

This is Part 3 of a three-part series on AI Native Infrastructure. Part 1 covers GPU cluster management. Part 2 covers agent platform engineering. This post covers IaC and Kubernetes as the two-layer control plane that makes both work at scale.

At hyperscale, managing GPU infrastructure without IaC is not a workflow — it's a liability. Companies like Meta operate GPU clusters at a scale where configuration drift, firmware inconsistency, or an undocumented network topology change can silently degrade a week-long training run. IaC is how you make infrastructure state auditable, reviewable, and reproducible.

But IaC alone isn't sufficient. It's worth asking: what exactly is Terraform managing? And what is it not managing?

The answer to that question reveals something important about how AI Native infrastructure actually needs to be governed — and why Kubernetes, despite not being designed for GPU workloads, remains the right runtime control plane for both the infrastructure layer and the application layer above it.

Two Types of Desired State

The confusion between IaC and Kubernetes usually comes from treating them as alternatives. They're not. They operate on fundamentally different types of desired state.

Provisioning-time state is the infrastructure topology: which nodes exist, how the network is configured, what storage systems are attached, what the base OS image looks like. This state changes infrequently — when you add capacity, refresh hardware, or change a network topology. Terraform excels here. You describe what the infrastructure should look like, run apply, and Terraform reconciles. The process is human-triggered, runs to completion, and then stops.

Runtime state is what's happening on top of that infrastructure: which training jobs are running on which nodes, how GPU resources are allocated across tenants, whether a job that lost a node has recovered from its last checkpoint, whether an agent's token spend this hour has exceeded its budget. This state changes constantly — second by second, driven by workload events. Terraform has no model for this. It runs once and exits.

Kubernetes was built for runtime state. Its control loop runs continuously, reconciling desired state against observed reality. When a node disappears, a controller notices immediately and reacts. When a new workload arrives, the scheduler places it. When a policy is violated, an admission webhook rejects it before it runs. This is an entirely different operational model from IaC — not better, not worse, just suited to different problems.

Meta's 2 million lines of Terraform are building the factory. Kubernetes is running it.

What IaC Owns in AI Native Infrastructure

In AI Native infra, IaC takes on more responsibility than in Cloud Native, not less. The hardware is more complex, the provisioning is more expensive to get wrong, and the cost of configuration drift is higher when a single misconfigured GPU node can silently degrade a week-long training run.

Cluster provisioning: GPU node pools, InfiniBand fabric configuration, NVMe storage attachment, base OS images with the right CUDA versions pinned. This is foundational and rarely changes — exactly the right job for Terraform.

Firmware version management: One of the more subtle failure modes in large GPU clusters is firmware version mismatch between nodes on the same training job. NVIDIA releases monthly GPU firmware updates, and running mixed versions can cause mysterious job instability. IaC solves this by treating firmware version as part of the immutable node image — the desired firmware state is declared in code, and any drift triggers a replacement rather than an in-place patch.

Network topology as code: InfiniBand fabric topology, ECMP routing policies, and RDMA configuration are critical to training throughput and notoriously easy to misconfigure. Declaring these in version-controlled IaC means topology changes go through code review, topology state is auditable, and accidental drift surfaces as a diff.

Zero-touch provisioning: At 10,000+ GPU scale, manual node provisioning is not a workflow — it's a bottleneck. IaC automation with BMC integration (Redfish API) means GPU nodes can arrive and self-provision without human touch. The desired state is already declared; the new node reconciles to it on first boot.

The pattern is consistent: IaC owns anything that is slow-changing, expensive to get wrong, and needs an audit trail. It is the source of truth for what the infrastructure is.

What Kubernetes Owns — And What It Needs to Learn

Kubernetes owns what IaC cannot: continuous runtime governance. But AI workloads expose significant gaps in its native capabilities. The good news is that its extensibility model — CRD + Operator + Admission Webhook — is the right foundation for filling those gaps. The extensions needed at the infrastructure layer and the application layer follow the same pattern.

At the Infrastructure Layer

Gang Scheduling

Kubernetes' default scheduler schedules Pods independently. For distributed training, this creates a deadlock scenario: partial job placement holds GPU resources while waiting for the remaining workers, which can't be placed because other jobs are in the same queue. No progress happens anywhere.

The fix requires treating a distributed training job as an atomic unit. Volcano and Kubeflow Training Operator introduce the PodGroup CRD — a new object that tells the scheduler "schedule all of these or none of them." This is a runtime concept that has no IaC equivalent: it's about what happens when jobs compete for resources in real time.

Topology-Aware GPU Placement

A training job scheduled across nodes in different InfiniBand fault domains, or across GPUs that don't share an NVLink switch, will train measurably slower — with no error, no alert, and no obvious cause. Native Kubernetes has no concept of GPU topology. Custom scheduler plugins, informed by node labels that IaC has set (describing NVLink domains, IB fabric membership, NUMA topology), allow the runtime scheduler to make placement decisions that preserve communication bandwidth.

This is the layered dependency clearly: IaC declares the topology as node labels, Kubernetes uses those labels at scheduling time.

GPU Health and Predictive Recovery

Kubernetes liveness probes were designed for HTTP services. GPU health requires something different: continuously monitoring ECC error accumulation, Xid fault codes, power anomalies, and SM utilization patterns. DCGM Exporter runs as a DaemonSet and surfaces these metrics to Prometheus. A custom controller watches for degradation signals and can cordon a node before it causes a job failure — shifting from reactive restart to predictive isolation.

Checkpoint recovery also needs operator-level intelligence. A training job operator that understands checkpoint semantics can restart a failed job from its last saved state rather than from scratch, replacing the failed node with a warm standby. Kubernetes provides the mechanism; the operator provides the domain knowledge about what recovery means for this workload type.

GPU Resource Granularity

The Device Plugin model exposes GPUs as opaque integers. For inference workloads that don't need a full GPU, this wastes capacity. MIG (Multi-Instance GPU) extends this model: NVIDIA MIG Manager runs as a DaemonSet and registers partitioned GPU instances (e.g., nvidia.com/mig-1g.10gb) as first-class Kubernetes resources. The partition configuration is declared in IaC (as part of the node image); Kubernetes exposes and schedules those partitions at runtime.

At the Application Layer

The same extension pattern — CRD + Operator + Admission Webhook — applies when the workload is agents rather than training jobs.

Agent as a First-Class Kubernetes Object

Today, most agents run as ordinary Deployments. Kubernetes has no concept of what an agent is: it doesn't know the agent has a prompt version, a tool manifest, a context budget, or a model dependency. A prompt change that alters behavior is invisible to the platform — it looks identical to any other container image update.

An Agent CRD changes this. It allows the platform to:

Track prompt version alongside container image version
Apply canary releases scored on output quality rather than error rate
Enforce that agents declare their tool manifest at deploy time, before any tool call can happen
Manage agent lifecycle (deprecation, rollback, multi-version coexistence) with the same operator patterns used for database migrations or stateful service upgrades

MCP Gateway as a Platform Component

MCP (Model Context Protocol) defines how agents express and invoke tools. Without platform governance, it's an intent-plane component: it describes what the agent can do but cannot constrain the consequences. A runaway agent with an MCP tool can make unbounded external API calls, consume unbounded tokens, and generate unbounded cost — with no platform-level circuit breaker.

The right model is an MCP gateway running as a platform-managed component — similar in position to Istio in the Cloud Native world. Admission webhooks validate that an agent's declared tool manifest is within permitted scope before the agent is deployed. The gateway intercepts tool calls at runtime, applies rate limits, logs full call chains for audit, and enforces budget-based circuit breakers. IaC provisions the gateway infrastructure; Kubernetes operators manage its lifecycle and configuration.

Token Quota as a ResourceQuota Extension

Kubernetes ResourceQuota enforces CPU and memory limits per namespace. AI Native apps need the same enforcement applied to token consumption, GPU compute time, and tool call volume. A custom admission webhook checks incoming agent requests against per-namespace token budgets at entry. A metering controller attributes consumption continuously — by namespace, by agent, by model, by use case — feeding the governance closed loop described in Part 2.

This is the direct AI Native equivalent of the ResourceQuota + LimitRange pattern used in Cloud Native multi-tenancy. The governance philosophy is identical; the resources being governed are different.

Ephemeral Validation Environments

When coding agents generate changes at high velocity, the CI pipeline becomes the bottleneck — not because CI is slow in absolute terms, but because it was designed for human-paced development. Kubernetes-native ephemeral sandbox environments (using service mesh request routing to deploy only the changed service and route specific traffic through it) change the economics. When a sandbox takes seconds to provision and costs a fraction of a full staging environment, agents can validate their own changes against live infrastructure as part of their workflow — before opening a PR.

IaC defines the sandbox template. Kubernetes operators provision and tear down sandbox instances on demand, triggered by agent workflow steps rather than human PR submissions.

The Converging Pattern

Looking across both layers, the same architectural pattern appears every time:

Concern	IaC owns	Kubernetes extends
GPU node configuration	Firmware version, OS image, CUDA version, node labels for topology	Device Plugin exposes resources; scheduler plugins use topology labels
Training job reliability	Checkpoint storage provisioning	Training Operator manages gang scheduling and checkpoint recovery
Multi-tenant isolation	Network segmentation, namespace provisioning	ResourceQuota, LimitRange, Admission Webhooks enforce at runtime
Agent governance	MCP gateway infrastructure provisioning	Admission Webhooks validate manifests; gateway controllers enforce at call time
Cost attribution	Metering infrastructure (Prometheus, storage)	Custom controllers attribute and enforce token/compute budgets per tenant

IaC declares what exists. Kubernetes governs what runs. Neither alone is sufficient; together they form a two-layer control plane that handles both the slow-changing provisioning concerns and the fast-changing runtime concerns that AI workloads demand.

What This Means in Practice

When I think about building an AI Native platform from scratch today, the question isn't "Terraform or Kubernetes" — it's "what does each layer own, and where are the handoff points?"

IaC handles everything that should be version-controlled, reviewed, and applied infrequently: node topology, firmware pins, network configuration, cluster bootstrapping. The principle is the same as in Cloud Native — immutable infrastructure, no configuration drift, everything auditable.

Kubernetes handles everything that needs continuous reconciliation: workload placement, resource enforcement, health response, agent lifecycle. The extension model — CRD + Operator + Admission Webhook — is how you teach Kubernetes about new resource types without forking the core platform. This is what made Kubernetes the right foundation for Cloud Native, and it's what makes it the right foundation for AI Native.

The IaC layer isn't competing with Kubernetes. It's provisioning the substrate that Kubernetes governs. Understanding that separation is the starting point for building AI infrastructure that stays manageable as it scales.

Back to Part 1: Cloud Native to AI Native infrastructure. Back to Part 2: the agent platform layer.

From Cloud Native Apps to AI Native Agent Platforms: The Belts Are the Problem

2026-04-09T08:00:00.000Z

This is Part 2 of a three-part series on AI Native Infrastructure. Part 1 covers the infrastructure layer — GPU clusters, schedulers, and hardware platform management. This post covers the application platform layer. Part 3 covers IaC and Kubernetes as a two-layer control plane.

In the late 1800s, when electric motors arrived in factories, most factory owners did the obvious thing: they removed the steam engine in the basement and dropped an electric motor in its place. Same shafts. Same belts. Same building layout. For thirty years, productivity barely improved.

The motor wasn't the problem. The belts were.

The real breakthrough came when a new generation asked a different question: if every machine can have its own motor, why do we need belts at all? Without belts, factories could reorganize around the flow of work rather than the flow of power. The result was transformative — not because the motor was better than the steam engine, but because removing the constraint unlocked an entirely different architecture.

Sri Shivananda's recent piece uses this analogy to describe what's happening with AI adoption today. We have the motor. But most organizations are keeping the belts — plugging AI into existing ticketing workflows, existing PR queues, existing stage-gated planning cycles. The AI works. The surrounding system neutralizes it.

I've been thinking about this through the lens of platform engineering. When I led application platform work during the Cloud Native era, the job was to abstract away infrastructure complexity and give application teams a stable, self-service surface. Now, with AI-native applications, I'm asking the same question in a new context: what does a platform need to provide for the next generation of workloads?

The answer, it turns out, requires dismantling some very comfortable belts.

What Cloud Native App Management Actually Was

It's worth being precise about what "application platform" meant in the Cloud Native era, because the contrast with AI Native is sharper when you're specific.

Cloud Native app management was fundamentally about lifecycle and traffic governance for deterministic services. The platform provided:

Service lifecycle: deployment pipelines, rolling upgrades, health checks, rollback triggers
Traffic control: L7 routing, canary releases, circuit breaking, mTLS between services
Multi-tenancy: namespace isolation, resource quotas, RBAC policies
Observability: distributed tracing, error rate dashboards, SLO-based alerting
Self-service: application teams could deploy, scale, and configure without waiting on the platform team

The implicit contract was simple: you give us a container, we give you reliable, observable execution. The workload was a black box. We didn't need to know what it did — only that it started, served traffic, and could be restarted safely.

That contract is the belt.

What AI Native Apps Actually Look Like

Cloud Native applications are deterministic. Given the same input, they produce the same output. You can write a test, run it in CI, and trust that green means correct.

AI Native applications — specifically anything involving agents, RAG pipelines, or LLM-backed workflows — are fundamentally different across several dimensions that matter for platform design:

Dimension	Cloud Native App	AI Native App
Execution unit	Service responds to request/response; behavior is predictable	Agent executes action/decision/side-effect; behavior depends on model, context, tool results
Failure mode	Binary: request succeeded or failed	Spectrum: correct, degraded, hallucinated, infinite loop, tool call runaway
Resource consumption	Predictable per-request CPU/memory	Highly variable token consumption; agent branching and reflection loops create long-tail cost distribution
State	Stateless preferred; session is ephemeral	Long context windows, KV cache, memory stores are first-class infrastructure assets
Governance object	Service instance, request count, error rate	Agent behavior, token spend, tool call chains, output quality

Each of these differences changes what the platform needs to govern. Together, they expose a fundamental problem with the existing platform contract.

The Platform Contract Is the Belt

Most organizations running agents today do the obvious thing: they package the agent as a container and deploy it as a Deployment. The entire Cloud Native platform governance machinery keeps running — health checks, resource quotas, image versioning, distributed tracing. Everything works. None of it tells you anything meaningful about what the agent is actually doing.

Health checks look for HTTP 200. The agent returns 200 whether it produced a correct answer, hallucinated, or entered an infinite tool-call loop. Resource quotas track CPU and memory. The agent's actual cost driver — token consumption — is invisible to the platform. Image versioning tracks container tags. A prompt change that fundamentally alters agent behavior looks identical to the platform as a no-op config update. Distributed tracing records request latency. It says nothing about which tools were called, in what order, at what token cost, or whether the output was any good.

This is the belt. Not a CI pipeline, not a staging environment. The assumption that "container + HTTP endpoint + CPU/memory quota" is sufficient to govern any workload — and therefore that Agents can be dropped into the existing platform without changing the platform.

The Deployment abstraction is the old shafts and belts. You dropped the new motor into the existing runtime, connected it to the existing health checks and resource quotas, and called it deployed. The agent runs. The governance doesn't work.

What the Platform Needs to Provide

If the Cloud Native app platform contract was "give us a container, we give you reliable execution," the AI Native equivalent is: "give us an agent, we give you governed, observable, cost-attributed execution."

Jimmy Song's framing structures this as a three-plane architecture with a governance closed loop:

Intent Plane: where agents and workflows express what they want — APIs, MCP tool protocols, task decomposition, and policy-as-intent (limits, budgets, compliance constraints baked in at entry)
Execution Plane: where work actually runs — training, inference serving, agent runtimes, and the state/context layer (KV cache, vector stores, context memory) that increasingly determines cost and throughput
Governance Plane: where consumption is constrained — compute quotas, topology-aware scheduling, isolation strategies, and the FinOps/SRE/SecOps controls that turn resource scarcity into manageable boundaries

The closed loop is the key concept. Each agent request enters with intent (what it wants to do), passes through admission control (is this within budget and policy?), executes with full metering (token spend, tool calls, GPU time), and feeds back into enforcement (is this within the organization's operational boundaries?). Without the loop, you have agents that work — but whose resource consumption is ungovernable.

This maps closely to how we thought about platform engineering in the Cloud Native era. Then, the governance loop was about service correctness: admission webhooks rejected misconfigured deployments, resource quotas prevented runaway consumption, SLO-based alerting triggered before user impact. The objects were different (service instances instead of agent behaviors), but the architecture pattern — encode constraints at entry, meter during execution, enforce at threshold — is identical.

Here's what the platform needs to build for AI Native apps:

Agent Lifecycle Management

In Cloud Native, we managed application lifecycles: build, deploy, scale, upgrade, decommission. Agents need the same lifecycle treatment, but the primitives are different.

An agent isn't just a container image with a health endpoint. It has a prompt definition, a tool manifest, a context window budget, a model version dependency, and potentially a persistent memory store. Versioning and rolling upgrades for agents need to account for prompt changes that can alter behavior even without a code change. Canary releases for agents require evaluating output quality, not just error rate.

This is the agent equivalent of the work we did building CI/CD pipelines and deployment automation for Cloud Native services — but the definition of "correct" is fundamentally harder to specify.

MCP Servers and the Gateway Layer

Model Context Protocol (MCP) is the emerging standard for how agents express their capabilities and call external tools. From a platform engineering perspective, MCP servers are analogous to service mesh sidecars in the Cloud Native world: they sit at the boundary of an agent's execution and mediate its interactions with external systems.

The platform's job is to provide a governed MCP gateway — a layer that enforces which tools agents can call, rate-limits tool invocations, logs call chains for audit, and prevents a runaway agent from making unbounded external API calls. Without this layer, MCP is purely an intent-plane component: it expresses what the agent can do, but cannot constrain the consequences.

This is the same lesson we learned with service mesh. Istio gave us L7 routing and observability, but the complex configuration it required — and the gap between intent and actual traffic behavior — created new failure modes we had to instrument our way through. Getting the MCP gateway right requires learning from that experience: start with observability, enforce constraints incrementally, and make the failure modes legible before adding enforcement.

Token Economics and Cost Attribution

In Cloud Native, resource quotas (CPU, memory, storage) per namespace gave teams visibility and accountability for their infrastructure spend. AI Native apps require the same discipline applied to token consumption, GPU time, and tool call volume — the three primary cost drivers.

The critical capability is end-to-end attribution: for each agent request, the platform should be able to answer "which team, which project, which model, which use case consumed what?" Without this, cost governance is impossible. You can see the aggregate bill, but you can't act on it.

This is not a FinOps afterthought — it has to be built into the platform architecture from the start. The metering must happen at the execution layer, not reconstructed after the fact from logs.

Observability: From Request Tracing to Behavior Tracing

Cloud Native observability was about distributed request tracing — following a request as it flowed through service A, called service B, wrote to a database. The questions were: where did this request spend its time? Where did it fail?

AI Native observability needs a different set of questions: what did this agent decide to do? Which tools did it call and in what order? What was the token consumption at each step? Did the output meet quality criteria? Where did the context window inflate beyond expected bounds?

The signals change, but the methodology doesn't: define what "correct" looks like, instrument the execution path, build dashboards that surface deviation from expected behavior, and alert before user impact rather than after.

What Transfers from Cloud Native App Platforms, What Doesn't

What Transfers

The platform engineering discipline — building self-service surfaces that abstract complexity, encoding governance as technical constraints rather than runbook discipline, driving adoption by making the right path the easiest path — transfers completely.

Specifically: the SLO-first approach to observability, the pattern of building automated lifecycle management before it's needed (not after scale breaks manual processes), and the hard-won lesson that users will bypass the platform unless the platform is genuinely better than the workaround.

We removed SSH access to force application teams onto the platform. The AI equivalent will be providing agent sandboxes and governed MCP gateways that are so much easier to use than building your own that teams adopt them voluntarily.

What Doesn't Transfer

The stateless-by-default assumption needs to be discarded. Context windows, KV caches, and agent memory are infrastructure-layer concerns, not application-layer afterthoughts. When a state asset becomes a determinant of system cost and throughput, it rises to the infrastructure layer. Platform teams need to manage context stores the same way they manage databases.

The deterministic validation model — green CI means correct — doesn't apply to agent outputs. The platform needs to provide evaluation frameworks, not just test runners: ways to score output quality, detect behavioral regressions across prompt versions, and give developers confidence that a change improved rather than degraded agent behavior.

The cost predictability assumption needs to be replaced with cost governance. Cloud Native resource costs were predictable within narrow bounds. Agent token consumption follows a long-tail distribution — a single agent with a reflection loop or a tool-call cascade can consume orders of magnitude more than the median request. The platform must be built to handle this, not just observe it.

The Mindset That Doesn't Change

In both eras, the Platform Engineer's job is the same at the core: take the complexity that would otherwise land on every application team, absorb it into the platform, and give teams a stable surface that lets them focus on what they're actually building.

Sri's factory analogy points at something real. The teams that will move fastest aren't the ones with the best AI models. They're the ones that redesigned their factory floor — who looked at the agent lifecycle, the cost attribution, the tool governance, and the observability architecture, and rebuilt those for the new workload rather than wrapping Agents around the old platform contract.

The belt is the assumption that an Agent is just another container, and that the platform doesn't need to know the difference.

That assumption is the thing worth replacing.

Back to Part 1: the infrastructure layer. Continue to Part 3: how IaC and Kubernetes divide the work as a two-layer control plane.

From Cloud Native to AI Native Infrastructure: An Infra Platform Engineer's Perspective

2026-04-08T00:00:00.000Z

This is Part 1 of a three-part series on AI Native Infrastructure. Part 1 covers the infrastructure layer — GPU clusters, schedulers, and hardware platform management. Part 2 covers the application platform layer. Part 3 covers IaC and Kubernetes as a two-layer control plane.

I've spent the past several years running one of the larger Kubernetes deployments I know of — 200+ clusters, 5,000+ applications, 50,000 nodes, 2 million instances. When the AI wave hit and my team started getting serious about GPU infrastructure, I kept asking myself: how much of what we built actually transfers? Where do we have to start over?

This post is my attempt to answer that question honestly. It's not a technology comparison or a vendor evaluation. It's a practitioner's account of what Cloud Native taught me, where it fell short, and what AI Native infrastructure at the hardware and cluster management layer actually demands.

What Makes a Cloud Native Application

Before talking about infrastructure, it helps to be precise about what we were actually supporting.

Cloud Native applications share a few defining characteristics. They are stateless by design — a Pod can be killed and restarted without any loss of correctness. They are horizontally elastic — add more replicas to scale, remove them to save cost. They have short startup times — containers come up in seconds, which makes rolling deployments, canary releases, and blue-green cutover practical. And their core SLA is expressed in terms of latency and throughput — P99, error rate, requests per second.

The underlying resource model is CPU and memory, both of which can be overcommitted. If a node has 16 vCPUs, you can schedule 32 vCPUs worth of workloads on it, betting on the fact that they won't all be at peak simultaneously. This flexibility is what makes Cloud Native infrastructure so cost-efficient.

The implicit assumption running through all of this: behavior is predictable, execution paths are orchestrable, failures are isolatable.

How Cloud Infra Was Built to Support This

Over several years building Cloud Native infrastructure at scale, we built three distinct layers.

The Orchestration Layer

Kubernetes' declarative desired state model is a near-perfect match for Cloud Native's elasticity requirements. When we migrated 5,000 applications from a VM-based CI/CD stack to Kubernetes, the core design was a five-phase migration blueprint per application: preparation, workload creation, traffic switching, baking period, and decommission — with rollback capability at each gate. We decoupled LB configuration conversion from orchestration, letting the two teams move in parallel without constant coordination overhead.

Blue-green rollouts and canary releases are foundational Cloud Native patterns, but they create significant scheduling pressure — large batches of Pod creations and deletions happening simultaneously. Supporting this at 5,000-application scale required the scheduler to be reliable and fast under burst load, which pushed us hard on control plane performance.

The Platform Layer

The single biggest mental model shift we made was moving from Ops thinking to Platform thinking. The difference: Ops thinking asks "how do I fix this problem?" Platform thinking asks "how do I make this class of problem impossible?"

At 200+ clusters, there is no other option. We couldn't afford to write automation for each new upgrade cycle or each new customer context. So we modeled OS and Kubernetes upgrades as declarative desired state using CRDs and custom controllers. We encoded correctness into admission webhooks — guardrails that cannot be bypassed — rather than relying on runbook discipline. We built self-service validation so teams could test their own upgrade compatibility without queuing on a central team.

The result: two engineers could handle bi-annual Kubernetes upgrades across the entire fleet with zero incidents. Monthly OS patching and continuous cluster operations became routine, non-events.

The Operations Layer

We automated the full cluster lifecycle — build, tech refresh, decommission — bringing end-to-end timeline from several weeks down to a few days. More recently, we embedded AI directly into operations: an MCP server for Cloud API access, automated triage agents for customer support, and local skills for incident diagnosis. Monthly incidents dropped from 3–4 to 1–2, and triage time for common issues was cut in half.

Auto-scaling and auto-rebalancing are also genuinely easy in Cloud Native because workloads are stateless. If a node disappears, reschedule. If load increases, add Pods. The infrastructure can react without knowing anything about what the workload is doing internally.

Lessons Learned the Hard Way

This is the part most architecture posts skip. Here are the real problems we ran into — and what actually fixed them.

API Server contention under multi-tenancy. We had APF (API Priority and Fairness) settings that were too coarse. One large client was consuming the majority of API Server request quota during a batch operation, cascading into degraded performance across the entire cluster for everyone else. The fix wasn't simply raising limits — it was configuring per-client APF buckets with appropriate priority tiers, so a misbehaving client couldn't crowd out critical control plane operations.

L7 migration complexity. Our existing L7 was built on commercial hardware load balancers with years of accumulated configuration — complex routing rules, cross-cluster dependencies, public and private traffic topologies. Migrating to Istio/Service Mesh sounded clean on a whiteboard and was genuinely painful in practice. Istio added latency, made failure diagnosis harder, and the migration itself required understanding hundreds of bespoke LB configurations. We built a custom Migration Controller that did automated spec matching, supported dry-run validation, and enabled phased traffic cutover. That tooling was the only reason we got through it without incidents.

Upgrade costs compound. Two major Kubernetes version upgrades per year, monthly OS security patches, continuous hardware refresh — if any of these are still manual processes at 200+ cluster scale, you have a headcount problem disguised as a technical problem. We built a Patching Platform and an Upgrade Platform to treat these as first-class automated workflows. Before that, we were routinely underestimating upgrade effort by 3–4x.

Observability without signal. Big clusters generate enormous metric and log volume. The early problem wasn't lack of data — it was lack of signal. We had thousands of alerts and no confidence in which ones mattered. The fix was to stop thinking about alerts as metric thresholds and start from SLO/SLI definitions: what does user-visible degradation actually look like, and what is the earliest measurable leading indicator? Once we had that anchor, we could build alert hierarchies that pointed to root cause rather than symptoms, and codify triage steps into runbooks that eventually fed automated recovery.

Users bypass the platform. Application teams that don't understand the platform abstraction will find workarounds — SSH directly to nodes, manual kubectl edits, configuration changes outside the automation pipeline. These create drift, incidents, and debugging nightmares. The most effective fix wasn't more documentation or more training. It was removing SSH access entirely. When the workaround doesn't exist, teams learn to use the platform.

AI Native Applications: What's Actually Different

At the infrastructure layer, the question isn't about agents making decisions — it's about what the underlying hardware platform needs to look like to support AI workloads at all. Here's where the differences hit hardest for infra engineers:

Dimension	Cloud Native	AI Native
Compute Resource	CPU/Memory, elastically overcommittable, linear cost growth	GPU at core, integer non-overcommittable, $25K–40K per card, hard cost ceiling
Workload Shape	Stateless services + short batch, Pods can be killed at any time	Long-running training (days–weeks) + inference serving
Failure Tolerance & Reliability	Fast restart recovery; infrastructure doesn't need to know what the workload was doing	Checkpoint-based stateful recovery; hours of compute lost per restart
Communication Pattern	East-west HTTP/gRPC, L7 traffic governance	GPU-to-GPU all-reduce via NCCL over InfiniBand; network bandwidth is a direct training throughput constraint
Scaling Unit	Single Pod, elastic horizontal scaling	Gang scheduling — all workers launch together or the job fails entirely
SLA & Observability	P99 latency/throughput, error rate, service dependency tracing	GPU utilization (target 85–95%), MFU, SM utilization, ECC errors, job completion time

The questions of agent governance, token economics, and what it means to treat an Agent as a first-class runtime object belong to the platform layer above this — that's the subject of Part 2. How IaC and Kubernetes divide this work between them is the subject of Part 3.

How Infra Must Support AI Native

The differences in the table above aren't just conceptual. Each one translates into concrete infrastructure work.

Rethinking the Scheduler

Kubernetes' default scheduler wasn't built for GPU workloads. The two biggest gaps are Gang Scheduling and topology-aware placement.

Gang scheduling means all worker Pods in a distributed training job must be scheduled simultaneously. If one worker can't be placed, the rest sit idle consuming GPU resources for no productive work. Native Kubernetes has no concept of this — you need Volcano or Kubeflow Training Operator to get it right.

Topology-aware placement matters because GPU-to-GPU bandwidth varies dramatically depending on where the GPUs are. Within a node, NVLink provides ~600 GB/s. Between nodes on the same InfiniBand fabric, you're looking at ~200 GB/s. Scheduling a tightly-coupled training job across nodes that happen to be on different fabric segments can cut effective bandwidth by 3x with no visible error — the job just trains slower.

GPU resource exposure adds another layer of complexity: the Device Plugin model exposes GPUs as opaque integer resources. MIG (Multi-Instance GPU) allows hardware-level partitioning of an A100 or H100 into up to 7 isolated instances — the right tool for multi-tenant inference where you need true isolation. Time-slicing is the software alternative: flexible but no hardware isolation boundary between tenants.

The Network Is No Longer About Routing

In Cloud Native, network work was largely about L7 traffic governance — service mesh, traffic splitting, mTLS, circuit breaking. That entire skillset, while still relevant for inference serving, is largely irrelevant for training.

Training clusters communicate via RDMA over InfiniBand. The abstraction layer disappears — you're tuning NCCL collectives, InfiniBand fabric topology, and all-reduce algorithms. At 10,000 GPU scale, this means Fat-tree Clos topology for consistent any-to-any latency, InfiniBand for training clusters, and high-speed Ethernet for inference. A misconfigured ECMP policy or a congested IB switch doesn't show up as a service error — it shows up as training throughput degradation that's hard to attribute.

Reliability Means Something Different

In Cloud Native, reliability is about uptime. A Pod crash is self-healing in seconds. The infrastructure doesn't need to care about what the workload was doing.

In AI Native, a training job that loses its state has to restart from the last checkpoint — potentially hours of compute lost. At 10,000 GPU scale, 1% of GPUs cause approximately 50% of job failures (Google's finding). This means the reliability engineering focus shifts from cluster-level uptime to GPU-level predictive health management: tracking ECC error rates, Xid codes, and power anomalies per GPU, and replacing or isolating unhealthy hardware before it kills a running job.

Automatic checkpointing, heartbeat-based hang detection, and warm standby spare nodes are no longer nice-to-haves — they're the baseline.

Observability at a Different Scale

Each GPU generates 10,000+ metrics per second. A 10,000-GPU cluster produces roughly 100 million metrics per second — 8.6 trillion data points per day. Traditional monitoring tools (Nagios, Zabbix, even naive Prometheus deployments) collapse under this load.

The right stack is DCGM for node-level GPU health data, Prometheus with hierarchical aggregation (rack → row → cluster), and ML-based anomaly detection for pattern recognition at scale. The metrics that matter are SM utilization (compute efficiency), memory bandwidth (data throughput), temperature (thermal throttling risk), and ECC error rate (hardware degradation signal) — not just "is the GPU allocated."

What Transfers, What Needs to Be Rebuilt

After doing this transition in practice, here's my honest assessment of what Cloud Native experience is actually worth in an AI Native context.

What Transfers

Transferable Capability	Table Dimension	Why It Transfers
Declarative IaC, immutable images	Compute Resource	GPU cluster management needs IaC more, not less. At hyperscale, every firmware version, network topology, and node image must be version-controlled and auditable. This is not optional.
Platform thinking: encode correctness as technical constraint	Governance & Platform Role	The governance objects change (service → model behavior), but replacing runbook discipline with admission controllers and policy-as-code works the same way.
Cluster lifecycle automation (build/upgrade/decommission)	Workload Shape	GPU training cluster lifecycle management mirrors Kubernetes cluster lifecycle. The automation patterns transfer directly.
Observability methodology: define SLOs first, then alert signals, then auto-recovery	SLA & Observability	The metrics change (SM utilization replaces P99), but the method — working backwards from user-visible SLOs to leading indicators — is identical.

What Needs to Be Rebuilt

New Capability Required	Table Dimension	What Changes
Resource model	Compute Resource	From "CPU overcommittable, elastic" to "GPU integer non-overcommittable, hard cost ceiling." Scheduling logic needs to be rewritten.
Failure philosophy	Failure Tolerance & Reliability	From "fast restart, stateless recovery" to "checkpoint-based stateful recovery." Failure cost goes from seconds to hours.
Network perspective	Communication Pattern	From "L7 traffic governance" to "collective communication performance tuning (NCCL/RDMA)." The network is now a compute resource.
Scaling logic	Scaling Unit	From "single Pod elastic scaling" to "Gang scheduling, all-or-nothing." One unschedulable worker blocks the entire job.

How Platform Engineers Stay Sharp

The most common mistake I see is treating AI infrastructure as a specialization that requires starting over. It doesn't — but it does require deliberate bridging.

Operate the tools you build. The best way I've found to build intuition for AI infrastructure is to use AI-augmented systems in my own workflow. Building the MCP server for Cloud API access, deploying triage agents for incident response, iterating on those systems in production — that gives you a practitioner's understanding of what AI Native infrastructure actually needs to support that you can't get from reading papers.

Approach AI from the operations side, not the algorithms side. Platform engineers don't need to understand backpropagation. We need to understand GPU utilization curves, job failure patterns, checkpoint strategies, and collective communication topology. The systems perspective is the right entry point, and it's the one we already have.

Build a mapping mental model. For every Cloud Native concept you know deeply, find the AI Native equivalent. Kubernetes scheduler → Volcano/Gang scheduler. Service mesh → NCCL collective communication. Pod health check → GPU ECC error monitoring. Horizontal Pod Autoscaler → GPU utilization-based job preemption. The concepts aren't identical, but having a structured map makes the new domain learnable rather than overwhelming.

Accept the new entry requirements. InfiniBand, NCCL, CUDA, DCGM — these aren't optional enrichment. They're the foundation of how AI infrastructure actually works at training scale. The good news: the learning curve is steep but not long. A few weeks of hands-on work with a real GPU cluster is worth more than months of reading.

The platform abstraction discipline remains the core skill. Whatever the underlying hardware generation, the job of a Platform Engineer is the same: encapsulate the complexity, expose a stable interface, and make the right thing the easy thing. That skill doesn't expire.

Closing: The Medium Changes, the Thinking Doesn't

Cloud Native taught me three things that I carry into AI Native infrastructure work: design for declarative state, build platforms rather than tools, and encode correctness into the system rather than relying on human discipline.

At the hardware and cluster management layer, AI Native infrastructure is harder in specific ways — the hardware is more expensive, the failure modes are more exotic, and the communication patterns require a completely different mental model. It demands new technical knowledge and some genuine unlearning.

But the underlying engineering discipline is the same. The goal is still to take a rapidly evolving hardware landscape and build a platform layer that absorbs the complexity — so the teams building on top of it can focus on what they're actually trying to accomplish.

The medium changes. The thinking doesn't.

Continue to Part 2: how to build a platform that treats Agents as first-class runtime objects. Or jump to Part 3: how IaC and Kubernetes divide the work between them as a two-layer control plane.

The AI-Augmented Engineering Manager: How I Run a Team in 2026

2026-03-23T00:00:00.000Z

Everyone's talking about AI replacing individual contributors. Nobody's talking about what it does to engineering managers.

That asymmetry is interesting to me, because in my experience, EMs stand to gain more from AI than most ICs — or lose more ground if they ignore it. The difference isn't which tools you use. It's whether you use AI to reclaim the time that actually matters, or just use it to make your status updates look better.

Here's my honest accounting of what changed after a year of deliberately integrating AI into how I manage my team.

What Actually Takes Up an EM's Time

Before getting into AI, it's worth being honest about the job. An Engineering Manager's responsibilities look clean on paper: delivery, people growth, alignment, technical direction, productivity. In practice, the calendar looks nothing like that.

The actual day is fragmented across writing status updates, preparing for stakeholder meetings, screending resumes, writing performance reviews, drafting job descriptions, responding to Slack threads, prepping 1-on-1s, and handling whatever operational fire is burning that week. These tasks aren't trivial — they matter — but they're also high-frequency, low-ceiling work. They consume time without compounding.

During hiring season, a single one-hour interview used to cost me 3–4 hours total: preparation, the interview itself, writing feedback. With 3–4 interviews a week, half my time was gone before I touched anything strategic. During performance review cycles, I worked late just to have enough evidence and documentation to do right by my team. On Fridays, writing the weekly highlight report felt like the most dreaded part of the week.

AI didn't change what an EM does. It changed how much time the operational layer takes — and by extension, what I can do with the time I recovered.

The rough number: the admin layer of my job now takes about a third of the time it used to.

Where AI Actually Helps

Staying Technical

There's a well-documented drift that happens to EMs over time: you move further from the code, and you start relying entirely on your ICs to interpret the technical reality of your systems. The risk is subtle — you think you understand what's happening, but you're really understanding someone else's summary of what's happening, filtered through their perspective and priorities.

AI coding tools changed this for me. With Claude Code, I can read an unfamiliar repository and understand what it does in a fraction of the time. I can review a PR and form a real opinion on the approach, not just the description. I've used the Claude Code Skills my team built — code review, bug triage — and actually fixed a bug myself, which gave me a level of system detail that no architecture diagram would have surfaced.

The point isn't that EMs should be writing production code. It's that having your own ground truth in technical conversations makes you a better manager, a better decision-maker, and a better advocate for your team with stakeholders.

Planning with Data Instead of Intuition

Quarterly planning used to take close to a month. Long meetings. Debate about what to cut and what to keep, often driven more by whoever argued most convincingly than by actual data. Decisions felt right in the room but hard to explain afterward.

Now the same process takes about two weeks. I feed in the team's current state, our yearly goals, stakeholder priorities, and capacity constraints, and AI helps structure the breakdown, surface dependencies, and generate the first draft of the plan. The decisions are still mine — and the judgment calls still require knowing the team, the codebase, and the organizational context. But the scaffolding is there faster, and the reasoning is documented.

What changed isn't the quality of the decisions, it's the quality of the conversation around them.

Upward Communication

Fridays used to be my least favorite day because of the weekly highlight report. Not because writing is hard — because translating technical work into stakeholder-relevant language, under time pressure, at the end of a long week, is genuinely draining.

A status update that used to take 1–2 hours now takes about 15 minutes. I put in the raw inputs — what shipped, what's at risk, what decisions were made — and AI produces a structured executive summary framed around value and outcomes, not technical implementation. The emails I send to VPs now are tighter, more direct, less cluttered with technical detail that doesn't serve the reader.

One thing worth flagging: prettier documents are not the same as better outcomes. I'll come back to this.

Understanding Your People Better

One of the hardest parts of performance reviews is that the evidence is scattered everywhere — Jira, GitHub, Slack, 1-on-1 notes, email threads. It's easy to inadvertently favor the work you remember most vividly, which is usually the work that happened most recently or created the most noise.

I now keep each engineer's 1-on-1 notes, their yearly goals, and the company's career development framework loaded in NotebookLM. At any point I can query across the entire history: what growth areas did we identify six months ago? What commitments did they make? Where have I seen patterns? When promotion time comes, the evidence doesn't require a frantic archaeology project — it's already organized.

The thing I heard most from engineers after implementing this: "You actually remembered that." They mentioned something in a 1-on-1 three months earlier, and I brought it up again with context. That moment of recognition — the feeling that their manager actually pays attention — is not something AI produces. But AI creates the conditions for it to happen consistently, not just when you happen to remember.

Making Better-Informed Decisions

Build vs. buy decisions, technical stack choices, team structure changes — these used to involve a lot of time gathering information before I could even form an opinion. Now AI compresses that research phase significantly. I can have a structured pros/cons analysis in minutes, not days.

To be clear: AI doesn't make the decisions. It can't weigh the organizational politics, the team morale dynamics, or the unstated priorities that any experienced EM knows matter more than the logical case. But arriving at a decision with better information, faster, is genuinely useful.

Six Workflows That Changed My Job

1. Upward Management — ChatGPT / Gemini

What I do: Weekly highlight reports, escalation emails, quarterly business review materials.

How it works: I feed in raw notes — project status, key decisions, risks — and AI drafts a structured executive summary. For important emails, I describe the situation and stakeholder context; AI produces a first draft in the right register.

What changed: The emails that reach my VP are now focused on impact and outcomes, not implementation details. They're tighter, clearer, and require far fewer revision cycles. The preparation time for a quarterly business review dropped from a full day to a couple of hours.

2. Internal Customer Management — Agent / MCP / Skills

What I do: My team serves internal engineering teams. I built agents and custom Skills to help those teams find information, run diagnostics, and handle common requests without requiring my team's direct involvement.

What changed: Team toil dropped by about 50%. More importantly, incidents dropped from 3–4 per month to roughly 1 — some months, zero. That's time I'm not spending on post-mortems, root cause analyses, and stakeholder communication during an outage. It's also morale: nobody enjoys the reactive work that follows an incident.

3. People Management and Performance Reviews — Glean / NotebookLM

What I do: 1-on-1 notes, career development plans, promotion documents, and (when necessary) performance improvement plans.

How it works: I use Glean to search across an engineer's Jira activity, Slack contributions, documents, and email history. NotebookLM holds the 1-on-1 notes, their yearly goals, and the career framework. When writing a promotion document, the evidence comes from every signal — not just what I happened to notice.

What changed: Performance materials are more complete and more credible. More importantly, engineers feel seen. The manager who remembers a specific thing you said three months ago — that's not a memory trick, it's a system. But to the engineer, it feels like genuine attention.

4. Hiring — Claude Code + Custom Skills Pipeline

What I do: End-to-end recruiting workflow, from job description to offer.

The pipeline:

JD Creation — Input role requirements, output calibrated job description
Resume-JD Matcher — Batch screening with match scores and key gaps
Interview Question Generator — Personalized questions based on the specific candidate's background and the role requirements
Interview Feedback Collector — Structured template that prompts for the right evidence; auto-aggregates across interviewers
Debrief Tool — Surfaces score divergence across interviewers, prompts discussion of disagreements
Offer Package Tool — Generates offer range recommendations based on leveling and market data

What changed: Each one-hour interview used to cost me 3–4 hours total. Now it's about 90 minutes. Candidates consistently say the questions feel tailored, not generic. The structured feedback reduces the chance of a strong candidate slipping through on a bad day.

Most visibly: our hiring cycle went from 3+ months down to 4–6 weeks.

5. Project Management — Claude Code

What I do: Quarterly plans, OKR tracking, weekly team health reports.

How it works: The planning system integrates yearly goals, breaks them down into sprint-level work, and produces a living quarterly plan. An automated weekly report pulls from Jira and GitHub to flag OKRs at risk before they become problems.

What changed: Our team OKR completion rate went from around 50% to 80%. Planning meetings went from 2–3 hours of debate down to a 30-minute alignment check. The improvement isn't just efficiency — it's that people know what we're doing and why, which changes how they make day-to-day decisions.

6. Remote Team Management — Claude Code

What I do: Team knowledge base, onboarding system.

How it works: Every core system has structured documentation: architecture overview, key modules, environment setup, development workflow, incident response runbook. Onboarding is broken down by day, not week.

What changed: A new team member in Dublin previously took about three months to reach full productivity. That's now around six weeks. The improvement comes from reducing the time spent asking people for context — the knowledge is findable without requiring someone to be available.

The Side Effects (Being Honest)

Beautiful Documents Don't Mean Better Execution

This is the one I had to learn the hard way. A polished Q2 plan, beautifully structured with clear OKRs and well-defined milestones, is still just a document. If the EM spends time refining the AI's output instead of validating whether the team actually understands and agrees with the plan, you end up with great artifacts and poor execution.

AI makes it easy to produce documentation that looks authoritative. That can create a false sense of clarity. The calibration still has to come from direct observation and honest conversation — no tool replaces that.

Tool Fatigue Is Real

In the past year alone, my workflow has gone through: ChatGPT → Glean + NotebookLM → Cline → Claude Code. First I was learning prompt engineering. Then agent frameworks. Then MCP. Now Skills and spec-driven development. Each shift required real time investment to do properly.

My suggestion: go deep on 2–3 core workflows rather than chasing every new tool. The compounding value comes from depth, not breadth.

Over-Reliance Degrades Judgment

If you generate every document through AI, your sense of what a good document looks like atrophies. If you consult AI before forming your own opinion on every decision, the muscle for making independent judgments weakens.

I keep a deliberate practice: for important decisions, I think through my own view before I ask AI to help structure or pressure-test it. The AI is most useful as a thinking partner, not as a replacement for thinking.

What AI Cannot Do for You

Build Trust

Trust is accumulated through consistency, follow-through, and genuine care — over time, in individual conversations, in how you handle the hard moments. AI can help you remember things. It cannot make you actually care about your people. Engineers know the difference.

Make Organizational Judgments

Stakeholder politics, the unspoken priorities behind a request, the right way to navigate a cross-team conflict — these require context and relationship that no AI has access to. AI gives you information. Judgment comes from knowing the organization, the people, and the history.

Handle a Crisis

When an incident fires at 2am and you need to make fast calls with incomplete information — who owns what, what to escalate, how to communicate outward while the team is working — you need your own clarity and calm. Pausing to prompt an AI is not a useful move in the middle of active incident response. The systems you've built help prevent incidents. But handling them when they happen is still on you.

Generate Curiosity

AI amplifies the curiosity you already have. It doesn't create it. The managers who are growing fastest in this environment are the ones using AI to explore questions they were already interested in — going deeper, moving faster. If you're not naturally curious about your craft, AI won't fix that.

The Framework: Use AI to Manage Information, Use Your Time to Manage People

Work type	AI can do	You have to do
Documentation & reporting	Generate, structure, polish	Decide what's worth writing
Performance management	Gather evidence, draft materials	The actual 1-on-1 conversations
Hiring	Screen, generate questions, aggregate feedback	Final judgment calls, offer conversations
Project planning	Break down, track, flag risks	Priority decisions and tradeoffs
Technical direction	Research, compare options, summarize	Choosing direction and owning the outcome

The EMs who benefit most from AI are the ones who use it to compress the operational layer — so they can invest more deeply in the irreplaceable parts: genuine relationships, organizational judgment, technical credibility, and the hard conversations that no tool can have for you.

The EMs who benefit least are the ones who use AI to polish the surface while the underlying work stays the same.

The gap between those two groups is growing.

How Ops Engineers Can Stay Relevant in the Age of AI: Becoming a Platform Engineer

2026-03-16T00:00:00.000Z

Two engineers. Two hundred clusters. Half a million nodes. Two million instances. Every year, two major Kubernetes version upgrades across the entire fleet — with zero incidents.

That's not a team of twenty. That's two people. And the reason it was possible isn't the tooling. It's the way we thought about the problem.

After years building Cloud Platform at a large e-commerce company and interviewing dozens of engineers for Platform roles, I've noticed a pattern. Most candidates who call themselves "DevOps" or "Cloud Operations" engineers are skilled, hardworking, and technically capable. But there's a fundamental difference in how they think — and that difference determines whether you're managing problems forever, or systematically eliminating them.

The Interview Pattern

When I interview candidates for Platform Engineering roles, I ask questions that are deliberately open-ended. One of my favorites: "How would you design a system to upgrade the OS across an entire Kubernetes fleet?"

The Ops answer sounds like this: "I'd write an automation script that SSHs into each node, drains it, upgrades the OS, and brings it back. We could parallelize it with some batching logic."

That's a competent answer. It would work. It would also need to be rewritten for the next fleet, extended for edge cases, debugged when something goes wrong at 2am, and maintained forever by whoever wrote it.

The Platform answer sounds different: "I'd model node upgrades as a desired state in Kubernetes — write a CRD that declares the target OS version and upgrade policy, and a Controller that reconciles the actual state toward that target. The platform enforces the policy; engineers just declare what they want."

This is the gap I keep seeing. Many DevOps candidates can configure CI/CD pipelines, write Terraform workflows, and automate deployments on cloud platforms. They're genuinely good at it. But when they move to the next project or the next client, they do it again — from scratch. The work doesn't compound.

Another pattern: K8s fluency that stays at the surface. Most candidates can create Services, Deployments, and configure Pod specs. Far fewer have written a CRD or built a Controller. Using Kubernetes and extending Kubernetes are very different skills — and that distinction is a reliable signal of platform thinking.

The Core Difference: Solving the Problem vs. Eliminating the Problem Class

The deepest difference between Ops and Platform thinking isn't the tools. It's the question you ask when something breaks.

An Ops engineer asks: "How do I fix this?"

A Platform engineer asks: "Why does this keep happening, and how do I make it impossible for this class of problem to recur?"

	Ops Thinking	Platform Thinking
When an alert fires	Respond and resolve	Why does this alert exist? Can it be designed away?
When teams ask "how do I deploy?"	Write a runbook	Build a self-service deployment platform
When the same bug hits three teams	Fix each instance	Abstract the fix into a platform guardrail
Customer	The infrastructure itself	Internal engineering teams
Goal	Keep systems running	Enable teams to move faster and safer

The customer question matters a lot. Ops engineers are often measured by uptime and ticket resolution. Platform engineers should be measured by the productivity and autonomy of the teams they serve. Your platform is a product. Your users are engineers. Their developer experience (DX) is your NPS score.

A Real Example: What Platform Thinking Looks Like at Scale

At my previous company, our Cloud Platform team managed 200+ Kubernetes clusters serving 5,000+ applications, running on 50,000 nodes and 2 million instances. Every year, we needed to upgrade all clusters to the latest Kubernetes version, add roughly a third more clusters, onboard hundreds of new applications, and patch OS images monthly to eliminate critical security vulnerabilities.

There's simply no Ops path to doing this. No amount of runbooks or scripts scales to that complexity without becoming a full-time firefighting operation.

For Kubernetes version upgrades alone, the challenge looked like this: we had dozens of customer patches — custom configurations and modifications for each of the 20+ internal platforms using our clusters (web framework, search, Hadoop, database, AI platform, and more). Each platform needed to validate on the new K8s version. The validated release then needed to roll out to 200+ clusters with zero service disruption.

The Platform approach:

Patch standardization. Every customer patch was formalized: structured description, core requirements, design rationale, test cases. This structure made patches auditable, reviewable — and eventually, AI-generatable. When a new K8s version drops, we can now automatically generate upgrade PRs for each patch by combining the patch specification with the new version's changelog.

Validation as a platform. We stood up clusters running the candidate K8s version and opened them to each internal platform team. Teams owned their test cases and validation metrics. The platform ran validation automatically and produced test reports. Each platform could self-service their validation rather than waiting in a queue.

Deployment as a platform. The control plane upgrade platform supported configurable strategies — parallel batching, staged rollouts, automatic rollback gates — rather than a one-off script someone had to babysit.

Result: two engineers maintained this platform and executed two major upgrades per year, across 200+ clusters, upgrading nearly 100 components with dozens of patches, with no incidents.

The leverage is the point. Two people doing the work of what would otherwise require a large team — not because they worked harder, but because they built a system that worked for them.

What Platform Thinking Actually Requires

After watching engineers develop (or not develop) this mindset over the years, I've identified five capabilities that distinguish platform thinkers:

1. Abstraction over automation

The most critical skill. Automation solves a problem. Abstraction solves a class of problems. The difference is whether your solution handles one team's specific deployment, or provides a general deployment capability that every team can use — including teams with slightly different requirements.

Abstraction ability determines whether you can solve one person's problem or a thousand people's similar-but-not-identical problems. It's what separates a script from a platform.

2. Product thinking

Platform engineers build internal products. That means thinking about API design, documentation, migration costs, and adoption. A platform no one uses isn't a platform — it's an expensive unused tool. The best platform engineers I've worked with obsess over their users' experience the same way product managers obsess over customers.

3. Leverage mindset

Ask yourself: does my work this week multiply my impact, or does it only solve today's problem? If one engineer's output can benefit 100 teams, that's leverage. If your work can only scale as fast as you can personally execute it, that's Ops. Both are valuable — but only one compounds.

4. Systems over processes

Ops culture tends toward process: checklists, runbooks, approval gates enforced by humans. Platform culture prefers system constraints: Policy as Code, Admission Webhooks, automated guardrails. A process can be skipped. A system constraint can't. Embed correctness into the platform rather than relying on people to follow rules correctly every time.

5. Avoid being the hero

Ops culture often produces heroes — the person who knows how everything works, who gets called at 2am, whose institutional knowledge is irreplaceable. That person looks valuable. But they're also a single point of failure, and they're not building anything that outlasts them.

Platform culture optimizes for "no heroes needed." Every capability should be self-service. Every critical piece of knowledge should be encoded in the system, not held in someone's head.

I've seen both trajectories play out on my team. One engineer spent decades in Ops thinking — always busy, always solving problems, always the person with the answers. But the work never compounded. The problems were always new versions of old problems. When the technology shifted, their knowledge became a liability rather than an asset.

Another engineer joined with limited Kubernetes experience. But he had platform instincts from the start: he asked "why does this keep happening?" before he asked "how do I fix this?" he thought about his changes in terms of who else could benefit. In three to four years, he went from junior engineer to leading the development of an autonomous upgrade platform.

The technology changes. The mindset is what compounds.

The AI Factor: Why This Matters More Now

The rise of AI coding tools changes the calculus significantly.

AI is excellent at well-defined, repetitive work. Writing an automation script for a known task. Fixing a specific bug with a clear stack trace. Generating Terraform for a standard deployment pattern. This is precisely the work that fills an Ops engineer's day.

The Ops hero — the person whose value is encyclopedic knowledge of how systems behave — is increasingly replicable by a large language model that has been trained on more Ops knowledge than any individual could accumulate in a career.

Platform engineering is different. Abstracting specific, messy real-world requirements into a clean, general platform capability requires judgment that AI doesn't yet have. Deciding which non-functional requirements matter (reliability vs. cost vs. developer experience), where to draw abstraction boundaries, how to design for the 80% use case while not blocking the 20% edge cases — these are synthesis problems that require context, taste, and experience.

AI can help a platform engineer move faster. It can generate the implementation once the abstraction is clear. But it can't do the abstraction itself — at least not yet.

In other words: AI accelerates Platform engineering, but it replaces Ops work.

Who Should Read This

This post is for Ops and DevOps engineers who want to make the transition — who are good at their jobs but sense that something is missing, that the work isn't compounding, that they're solving similar problems on repeat.

The transition isn't about learning new tools. Kubernetes, Terraform, and CI/CD platforms are table stakes. The transition is about changing the question you ask when you encounter a problem.

Not "how do I fix this?" but "how do I make this class of problem go away?"

Not "how do I automate this for this customer?" but "how do I build something that 100 customers can use without my involvement?"

Not "how do I become the person everyone depends on?" but "how do I build something that doesn't need me?"

An Ops engineer who doesn't think about platforms is not only leaving impact on the table — in the age of AI, they're also the most exposed. The work that fills an Ops calendar is exactly the work AI does well.

The engineers who will matter most in the next decade are the ones who build the systems AI assists with, not the ones who do the work AI replaces.

Twenty Years of Agile, One Year of AI — Here's What Survived

2026-03-09T00:00:00.000Z

I grew up as a developer reading Martin Fowler and Kent Beck. The Agile Manifesto, the refactoring patterns, test-driven development — these weren't just methodologies I was handed. They were the lens through which I learned to think about software quality, team dynamics, and sustainable delivery.

Now I'm spending significant time with AI coding tools — Vibe Coding, Claude Code, spec-driven workflows — and a question keeps surfacing: do these principles still apply?

My answer, after a hands-on 50K-line project experiment, is yes. Not only do they apply — several of them become load-bearing pillars in an AI-augmented workflow.

The New Design Principles for AI Coding

A friend of mine did a deep, systematic study of AI coding tools and frameworks. He synthesized seven core principles for what he calls "AI Native" development — essentially Software 3.0. They're worth understanding before discussing Agile, because they frame the new context we're working in:

File as Truth — Store AI memory, config, and state in human-readable local files. The file system is the universal interface between humans and AI.
Explicit State — AI must continuously record what it's doing, where it is, and what problems it's encountered. The "three-file pattern": plan.md, status.md, log.md.
Plan Before Execute — Separate design authority from execution authority. AI generates a plan, humans approve it, then AI builds.
Validation as Closed Loop — Don't trust AI's self-assessment. Build deterministic, test-based auto-correction loops. Red/green TDD is the mechanism.
Tiered Memory — Structure AI memory across volatile context (RAM), session logs, and persistent disk files. Actively manage what gets compressed or forgotten.
Skill as Asset — Encapsulate AI capabilities as modular, reusable skills. AI learns to expand its own toolbox.
Context as Budget — Tokens are expensive and scarce. Load tools on demand, prune redundant output, cache prompts.

These are well-constructed. But reading through them, something struck me: most of these aren't new ideas dressed in AI clothes. They're old ideas that become newly critical at AI scale.

My Own Reaction to "File as Truth"

The principle that made me think the hardest was "File as Truth." In traditional Agile, we said code is truth — the code is the unambiguous, definitive representation of what the system actually does. Documentation lies. Requirements drift. But the code runs.

AI-native thinking flips this: keep the truth in files (specs, architecture docs, structured state), and let AI generate the implementation.

I see the appeal. Without anchoring AI to structured context, it will hallucinate, drift, and invent. Having a canonical document that survives across sessions — that the AI reads before acting — is clearly better than relying on prompt memory alone.

But I'm not fully convinced of the flip. The implementation gap is real. A natural-language document is inherently ambiguous; the same spec can produce very different code depending on how the AI interprets it. The more latitude you give AI on implementation, the more you need to verify the result is actually what you intended.

The answer, I think, isn't to choose between "code is truth" and "file is truth." It's to recognize that both co-exist and must stay in sync. The file defines intent; the code is the executable artifact; the tests verify they match. The problem is that AI can generate documents as easily as code, so without discipline, you end up maintaining two rapidly-diverging sources of truth with no reliable bridge.

This is where Agile principles come back in with force.

Agile Principles, Mapped to AI Coding

The Agile Manifesto has four values and twelve principles. I won't enumerate all of them — but here are the ones that resonate most directly with AI-augmented development, and what I've observed about how each one applies.

Test-Driven Development: The Closing Loop

TDD was always my favorite practice. Write the failing test first, then write the code to make it pass. It forces you to define success before you start building, and it creates an automated verification mechanism that doesn't require you to remember what "done" looks like.

In AI coding, TDD becomes something more important: it's how you close the loop.

AI is creative but non-deterministic. Left to its own judgment, it will implement something that plausibly looks correct. With a failing test, it has an unambiguous target. More importantly, it can run the test itself, observe the failure, and self-correct without human intervention. This is what "Validation as Closed Loop" actually means in practice. The principle is elegant; TDD is the mechanism.

Without tests, you're reviewing AI-generated code by reading it — a slow, error-prone process that doesn't scale. With tests, you shift from "is this code right?" to "does this code pass?" That's a much more tractable question.

Small Iterations, Frequent Delivery

This principle gets stronger in the AI era, but the unit of iteration changes.

In traditional Agile, a sprint might be two weeks. In AI-augmented development, meaningful feedback loops operate at the granularity of minutes. A useful heuristic I've encountered: if a task takes more than five minutes to complete, it should probably be broken into smaller tasks.

The reasoning is that AI context windows are finite. Long sessions accumulate drift — the AI gradually loses track of earlier constraints, architectural decisions, and your stated preferences. Shorter tasks that complete and close maintain higher coherence. You also get faster human review cycles, which is where the real quality control happens.

The tricky part is calibrating granularity. Tasks that are too small generate too much overhead — constant context-loading and handoffs. Tasks that are too large drift. Finding the right unit of work is genuinely hard, and I think it's going to become one of the core skills of experienced AI developers.

Pair Programming and Individual Interaction

This is the Agile principle I find most personally relevant, and the one I feel most tension around.

Pair programming at its best is a continuous conversation — two minds colliding on the same problem, catching each other's mistakes, sharing knowledge, and building something better together than either could alone. The value isn't just the code produced; it's the thinking process.

With AI, there's a version of this that feels authentic: I give a rough requirement, the AI designs an approach, breaks it into tasks, builds a prototype, I use it and give feedback, we iterate. This works remarkably well for UI refinement, performance tuning, and cases where I don't fully understand what I want until I see something tangible.

But there's a failure mode I've noticed. Heavily structured frameworks — things like SpecKit or detailed plan-first workflows — can over-formalize the interaction. You end up managing documents and approvals more than you're actually collaborating. The "pair" feeling disappears.

My current view: stay in the conversation longer than the frameworks suggest. Resist the urge to hand off too early. The value of pair programming with AI isn't just speed — it's the back-and-forth that surfaces what you actually want to build.

Refactoring: More Courage, Not Less

Refactoring has always been about managing the accumulation of complexity. Small design improvements, continuously applied, prevent the architecture from calcifying into something unmaintainable.

The interesting thing about AI and refactoring is that AI gives you more courage to refactor, not less.

In traditional development, refactoring required careful attention to not break existing behavior. IDE tools helped — but they were limited to mechanical transformations like renaming and extraction. Anything structural required deep understanding of the codebase.

With AI, you can describe a structural change at a higher level of abstraction and have it applied across a large codebase with much lower risk of human error. The AI understands context in ways that text-replacement tools do not. This lowers the cost of refactoring, which means you can do it more frequently, which means the codebase stays in better shape.

The risk, though, is the opposite failure mode: not refactoring. If you're solely focused on delivering feature after feature with AI, you can accumulate complexity faster than ever. Ten features in ten sessions, without a refactoring pass between them, and you may end up with code that no human can review — and eventually, that even AI can't reason about coherently.

Working Software as the Measure of Progress

Agile is explicit: working software, not documentation, is the primary measure of progress. This puts it in direct tension with AI-native thinking's emphasis on files and documentation as truth.

I don't think the tension is unresolvable, but it does require active management. The danger in AI development is documentation inflation. AI can generate documentation easily and enthusiastically. If you're not careful, you end up with a sprawling set of spec files, plan files, and status files that nobody is maintaining — least of all the AI, which has long since moved on to other sessions.

The discipline I've settled on: documentation should be minimal, structural, and kept in sync by automated means where possible. A CLAUDE.md that captures architecture decisions and conventions. Tests that document behavior. As little else as possible. The more documentation you have, the more surface area there is for inconsistency.

Working software remains the truth. The rest is scaffolding.

Retrospection and Embracing Change

The last two principles are deeply linked. Agile retrospectives are about systematically capturing what you've learned so that future iterations improve. Embracing change means treating new requirements as a natural part of the process rather than a disruption.

In AI tooling, the equivalents are emerging. Tools like SpecKit and OpenSpec have a concept of "achieved" — lessons learned from completed work that get folded back into your project's guiding principles (CLAUDE.md or similar). This is retrospection, formalized.

Mechanisms like Ralphloop — which allow an AI to iterate on its own approach autonomously — are a form of embracing change, though they require careful oversight to avoid runaway scope drift.

The underlying idea holds: learning should compound across iterations. Whether that learning lives in a retrospective document, an updated CLAUDE.md, or a refined test suite, the principle is the same. Don't repeat the same mistakes.

The Bigger Picture

There's a pattern across all of these: Agile principles were designed to manage complexity and uncertainty in human-scale software development. AI doesn't eliminate complexity and uncertainty — in some ways, it amplifies them. The speed of generation goes up; the need for coherent architecture and verified correctness goes up equally.

A friend who researches AI coding described the ideal as: "File provides memory, you provide discipline." I think that's right. The file doesn't discipline itself. The tests don't write themselves. The refactoring doesn't happen automatically.

What AI does is lower the cost of execution so dramatically that the bottleneck shifts. The bottleneck used to be writing the code. Now it's knowing what to build and verifying that what was built is correct. Those are judgment problems. They're exactly what Agile was designed to develop in engineers.

I spent twenty years learning these principles from books and practice. I'm glad I did. They don't just survive the AI transition — they've become more essential.

The engineers who will thrive in this environment aren't the ones who generate the most code. They're the ones who know how to close loops, calibrate iteration size, stay in the conversation, refactor at the right moments, and keep working software at the center.

That sounds a lot like Agile to me.

[7/6] Claude Code: From Vibe Coding to Spec-Driven Development

2026-03-06T00:00:00.000Z

This is an extended chapter to the 6-part Claude Code series. The first six chapters documented building a full-stack Finance app using Vibe Coding. This chapter covers what came next.

The first six chapters documented the complete journey of using Claude Code for Vibe Coding — building a full-stack application from scratch and accumulating 40,000 lines of code. Vibe Coding delivered incredible speed, but as the project grew, a structural problem emerged:

AI writes code fast. AI also goes off-track fast.

When you describe a requirement in one sentence, AI might understand 70% of it and then sprint full-speed in that direction for two hours — only for you to realize the core logic is wrong and have to start over.

This isn't theoretical. Before adopting SDD, my real pain points in the Finance project were:

Unstructured workflow: I had to remind AI to organize requirements before writing code, otherwise it jumped straight to implementation
Missing design documentation: architectural issues only surfaced after implementation, making course corrections expensive
Inconsistent code quality: the same requirement could produce wildly different code quality across sessions
Tests routinely skipped: Vibe Coding tends toward "get it running first," making tests optional
Slow debugging: without clear task boundaries, bugs were hard to locate and back-and-forth with AI was inefficient

This chapter documents a methodology upgrade experiment: introducing Spec-Driven Development (SDD) into the Finance project using OpenSpec, completing three new features, and comparing results against prior Vibe Coding work.

What is Spec-Driven Development

Core Idea

The core principle of Spec-Driven Development is: reach consensus before writing code.

In traditional Vibe Coding, the flow is:

Idea → One-liner prompt → AI starts coding → Iterate as you go

In SDD, the flow is:

Idea → Structured proposal → Task checklist → AI implements by checklist → Archive spec

The difference isn't the tooling — it's when decisions are made. SDD forces all important decisions (feature scope, technical approach, acceptance criteria) to happen before coding begins, locking them into documents that constrain the AI to execute within a well-defined space.

What is OpenSpec

OpenSpec is a lightweight AI workflow CLI tool designed specifically for SDD. Its core is a standardized project structure and three commands:

openspec/
├── specs/           # Current full spec (source of truth)
│   └── /
│       └── spec.md
└── changes/         # In-progress changes
    ├── /
    │   ├── proposal.md   # Why, what, and scope of the change
    │   ├── design.md     # Technical approach
    │   ├── tasks.md      # Decomposed implementation checklist
    │   └── specs/        # Delta spec (additions/modifications only)
    └── archive/          # Completed and archived changes

Installation and setup:

npm install -g @fission-ai/openspec@latest
openspec --version

cd your-project
openspec init

The Three-Phase OpenSpec Workflow

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   PROPOSE   │────▶│    APPLY    │────▶│   ARCHIVE   │
│             │     │             │     │             │
│ proposal.md │     │ tasks.md    │     │ specs/ sync │
│ design.md   │     │ [x] task 1  │     │ change →    │
│ tasks.md    │     │ [x] task 2  │     │  archive/   │
│ specs/delta │     │ [ ] task 3  │     │             │
└─────────────┘     └─────────────┘     └─────────────┘
  Align upfront       Implement           Lock in spec

OpenSpec is used in Claude Code as three Skills:

Phase 1: `/opsx:propose`

Input: A one-sentence description or draft requirement Output: proposal.md (Why/What/Scope), design.md (technical approach), tasks.md (decomposed task checklist), specs/ (delta spec)

This phase transforms a vague idea into an executable contract. AI plays the role of architect and product manager; you play the reviewer.

Key action: Review tasks.md — check that task decomposition is reasonable and acceptance criteria are clear. Fixing issues here is far cheaper than tearing things apart mid-implementation.

Phase 2: `/opsx:apply`

Input: Reviewed and approved tasks.md Output: Code, tests, and configuration changes implemented item by item

AI executes tasks in order, marking each [x] on completion. You can pause at any point, review progress, correct direction, and resume.

Key action: Don't insert new requirements during implementation. If requirements change, update the proposal first, then resume apply — that's SDD discipline.

Phase 3: `/opsx:archive`

Input: Completed change directory Output: Delta spec merged back into openspec/specs/, change directory moved to archive/

Archiving keeps the spec library (specs/) always representing "the current state of the system," becoming the starting point for the next change.

config.yaml: OpenSpec's "Project Memory"

After completing the first feature, I realized I'd skipped an important setup step: configuring openspec/config.yaml. Ideally this should be done before the first feature, but better late than never — starting from the second feature, it began paying dividends.

This file is OpenSpec's equivalent of CLAUDE.md — it tells AI about the project's tech stack, coding conventions, and historical mistakes to avoid.

Initializing config.yaml:

Please update the config.yaml under the openspec directory. Refer to the root CLAUDE.md
for tech stack, conventions, and code style guidelines. Refer to README.md for domain
knowledge. Use the example format provided in the config.yaml file.

After AI generated the initial version, I made two targeted additions:

Adding a testing strategy:

Please add a testing strategy to config.yaml. This is a full-stack application —
testing should cover both backend API testing and frontend UI testing.

Documenting historical mistakes (to prevent recurrence):

When developing the runway feature with OpenSpec, two mistakes were made:
1. Currency was ignored — account amounts were summed directly without conversion
2. The fix introduced a performance issue — exchange rates were queried from DB
   per record, when the Controller layer already has a cached ExchangeRateService

Please add these to config.yaml so future changes avoid repeating them.

config.yaml isn't a one-time setup — it's an ever-growing error prevention manual. Each new mistake gets added, and AI proactively avoids it in every subsequent change.

Three Features in Practice

Feature 1: Runway Analysis

Requirement: Based on current liquid assets and projected monthly expenses, calculate how long the family's funds will last.

Propose:

I want to add a new function, runway analysis. I have one example at C:\...\runway-calculation,
please use the same structure. You can get future monthly expenses from the system,
and liquid assets from the system. Please create a proposal.

AI generated a checklist of 27 tasks covering backend API, frontend pages, and tests — with only one manual task (testing via Swagger UI).

Problems discovered during Apply:

Issue 1 (critical bug): Currency not aligned

The initial implementation simply summed all account balances, ignoring multi-currency — a USD account and a CNY account added directly, producing completely wrong results.

Root cause: this was a business understanding problem, not a technical one. AI writes code quickly but doesn't spontaneously "think about" the need for currency conversion.

After fixing that, a second problem appeared:

Issue 2 (performance bug): Every exchange rate lookup hit the database

When fixing the currency issue, AI queried the database for exchange rates on every single record, making report generation extremely slow. The system already had a cached ExchangeRateService that only needed to be called once.

Both bugs were fixed quickly after pointing them out — but more importantly: both were written into config.yaml, protecting all future changes from repeating them.

After Runway Analysis was complete, requirements expanded: adding the ability to exclude specific liquid assets and adjust individual expense items — normal in SDD, completed by updating the proposal and continuing apply.

Stats:

Code added: ~1,900 lines, 18 files
Tasks completed: 26/27 (1 manual)
Development time: ~2 hours

Feature 2: Runway Report Persistence and PDF Export

Requirement: The Runway page recalculates from scratch every time it opens. There's no way to save a snapshot for later review.

How SDD handles requirement changes:

This feature went through three requirement changes — a good test of SDD's flexibility:

Initial: Export JSON file to local disk
↓ User found JSON unfriendly
Change 1: Export as PDF report instead
↓ User changed mind, didn't want local-only storage
Change 2: Persist to backend database, add report list page

The third change was the most significant — requirements shifted from "pure frontend" to "full-stack with new database table."

Key decision: OpenSpec detected the large scope change and deleted the already-generated proposal and tasks to regenerate from scratch.

This is SDD discipline: don't patch a half-baked proposal — when requirements change significantly, re-propose. Experience proved this right — in prior Vibe Coding work, piecemeal modifications to half-formed requirements consistently confused AI and produced worse results.

The regenerated proposal produced 34 tasks across 11 categories (backend entity/Repository/Service/Controller, frontend components, database migration, backend tests, frontend tests).

Problems during Apply:

API routing error: Save Report didn't work — diagnosis revealed a Controller routing misconfiguration. Added to config.yaml.
PDF Chinese character corruption: The PDF library AI initially chose didn't support Chinese. Switching to a different implementation resolved it.
Mock test failures: New mock testing framework had incorrect initial configuration; fixed based on error output.

Post-archive follow-up:

Test coverage was still weak (frontend UI tests were manual only). A dedicated Vitest + Vue Test Utils setup was done afterward:

# Install component testing framework
npm install -D vitest @vue/test-utils @vitejs/plugin-vue jsdom
# Update vite.config.js to configure test environment

Stats:

Code added: ~1,800 lines, 25 files
Tasks completed: 33/34
Development time: ~38 minutes (from second proposal to archive)

Feature 3: Property Investment Calculator

Requirement: Convert an Excel spreadsheet ("The Brutal Calculator") into a native web calculator for Bay Area high-income earners to evaluate after-tax returns on rental property investments.

Propose:

I added an Excel file under the requirement folder (The Brutal Calculator.xlsx).
Please read the sheet and convert it as a new feature: Property Investment Calculator.
I may add a new group (投资/Investments) in the sidebar.

Reading the Excel file took some setup time (an xlsx parsing tool needed to be installed), but AI fully parsed all formula logic and generated:

8 task groups, 22 tasks
Covering: Vue component, formula utilities, route registration, Sidebar changes, Vue component tests

This was the only purely frontend change among the three features — no backend changes, no database changes.

Problems during Apply:

Formula calculation errors: PMT (mortgage payment) and CUMPRINC (principal paydown) were implemented incorrectly. Fixed after pointing it out. This type of error reflects AI's imperfect understanding of financial formulas, not a code capability issue.

Post-apply, some UI adjustments were made (2-column layout changed to 3-column, label display tweaks), all completing smoothly.

Figure: Property Investment Calculator — 13 editable inputs on the left, five real-time result panels on the right

Stats:

Code added: ~2,400 lines, 22 files
Tasks completed: 19/20 (task 20 was optional manual testing, verified by user)
Development time: ~49 minutes

Side-by-Side Comparison

	runway-analysis	runway-report	property-calculator
Code added	~1,900 lines	~1,800 lines	~2,400 lines
Task count	27	34	20
Complexity	Full-stack, no new DB table	Full-stack + new DB table	Frontend only
Test coverage	Manual backend tests	Auto backend + manual frontend	Auto frontend + backend
Critical errors	Currency alignment, rate perf	API routing, PDF encoding	Financial formula errors
Dev time	~2h	~38m	~49m

Why were Features 2 and 3 so much faster than Feature 1?

It wasn't AI getting smarter. Three reasons:

config.yaml accumulated lessons: The currency/exchange rate issues were written into config after Feature 1. Features 2 and 3 didn't repeat them.
Test infrastructure was in place: Vitest was set up after Feature 1; subsequent features built on it directly.
Clearer requirements: After the first feature, proposal descriptions became more precise, reducing AI's interpretation errors.

SDD vs. Vibe Coding: When to Use Which

Dimension	Vibe Coding	Spec-Driven Development
Requirement clarity	Fuzzy is fine, define as you go	Need to think through feature scope upfront
Feature complexity	Small (< 5 files)	Medium to large (cross-layer, multi-task)
Drift risk	High (AI runs fast in wrong direction)	Low (task checklist constrains direction)
Flexibility	High (change direction anytime)	Requires updating proposal before continuing
Traceability	Depends on git history	Full record in proposal/tasks
Best for	Prototyping, exploratory features	Deliverable features with acceptance criteria

Practical recommendation:

Use Vibe Coding to validate ideas. Use SDD to deliver features.

Specific decision criteria:

✅ Change touches 3+ files → use SDD
✅ Requires both frontend and backend changes → use SDD
✅ Includes database schema changes → use SDD
✅ Has explicit acceptance criteria → use SDD
⚡ Quick UI tweaks, small bug fixes → Vibe Coding is enough

OpenSpec vs. Other SDD Tools

Three AI workflow tools with different positioning:

	OpenSpec	SpecKit	Superpowers
Positioning	Lightweight CLI, focused on change management	Heavyweight spec framework, full SDD system	Claude Code Skills extension library
Spec approach	Delta spec (write only what changes)	Full spec (complete specification documents)	Skill-based workflows
Learning curve	Low (up and running in a day)	High (requires understanding the spec system)	Low (use Skills directly)
Best for	Small-to-medium projects, fast iteration	Large projects needing strict spec governance	Enhancing Claude Code capabilities
Archive mechanism	Built-in (`archive` command)	Built-in	None

I've used SpecKit-style full-spec SDD at work, OpenSpec on this Finance project, and Superpowers on a personal blog project. My current preference is OpenSpec, and I'm exploring combining it with Superpowers.

OpenSpec handles change management and spec accumulation; Superpowers adds day-to-day workflow enhancements (like brainstorming and verification-before-completion). They don't conflict — they stack.

Key Takeaways

Three core recommendations:

1. config.yaml is the most important investment

Before writing any code, spend 30 minutes putting the project's tech stack, conventions, and known mistakes into config.yaml. This is a one-time investment that pays back on every subsequent feature, with returns that compound over time.

2. When requirements change significantly, re-propose — don't patch a half-baked proposal

Feature 2's three-round requirement changes proved this. When the scope shifts more than ~50% from the original proposal, starting over is faster. AI works more efficiently in clear context; in muddled context, it makes strange decisions.

3. Write every mistake back into config.yaml

This is the biggest behavioral difference between SDD and Vibe Coding. Vibe Coding mistakes leave traces only in git history and tend to recur. SDD mistakes get distilled into structured rules — they become the project's "error prevention DNA."

Quantified results:

Three features combined:

Code added: ~6,100 lines
Tasks completed: 78/81
Total development time: ~3.5 hours
Average per 100 lines of code: ~3.5 minutes

Feature 1 (~2 hours) included the cost of learning the workflow and building config.yaml. Features 2 and 3 (combined ~87 minutes, ~4,200 lines) represent the actual velocity once SDD is established.

References

OpenSpec

OpenSpec on GitHub — Official project on GitHub
OpenSpec Introduction — Full workflow documentation and config.yaml reference

Spec-Driven Development — Further Reading

OpenSpec vs SpecKit in Depth — Detailed comparison of design philosophy and use cases (Chinese)
SpecKit vs OpenSpec Comparison — Technical comparison from intent-driven.dev

Superpowers

Superpowers Claude Code Skills — Workflow enhancement skills for Claude Code, combinable with OpenSpec

Other Chapters in This Series

Chapter 4: Software Development Methodology in the AI Era — Vibe Coding methodology background
Chapter 6: Conclusion and Future Outlook — Finance project overall data and lessons

Finance Project

GitHub Repository — Full source code for all examples in this article, including CLAUDE.md, Skills, and openspec configuration

No Junior Engineers? What AI Really Means for Early-Career Developers

2026-03-04T00:00:00.000Z

There's a narrative spreading through the industry right now: AI is eliminating junior engineering roles, and early-career developers are the first casualties of the automation wave.

After years of interviewing candidates and leading engineering teams, I think this narrative is half right — and dangerously incomplete.

The Shrinking Junior Market Is Real

Let's not soften the reality. The job market for new graduates in software engineering has gotten harder, in both the US and China. Hiring freezes, smaller cohorts, longer job searches — the data backs this up.

The underlying logic is straightforward: a senior engineer with good AI tooling can produce what used to require a senior plus two or three juniors. And from a business perspective, junior engineers are expensive to onboard — they require code reviews, mentorship, and ramp-up time. Many will leave once they've grown. The ROI calculation has shifted.

So yes, the market is contracting at the junior level. Anyone telling you otherwise isn't paying attention.

But the Reason Matters

The reason junior roles are disappearing isn't that junior work has no value — it's that AI has gotten very good at the execution layer of software development.

Here's how I think about it:

What AI handles well (the "Know-How" layer):

Boilerplate and scaffolding
Unit tests and documentation
Debugging and error explanation
Code translation and refactoring

These are exactly the tasks that used to serve as the training ground for junior engineers. That's the uncomfortable truth.

What AI still can't do (the "Know-What" and "Know-Why" layers):

Deciding what to build and why it matters
Understanding business context and stakeholder constraints
Making architectural trade-offs under uncertainty
Taking accountability when production systems fail

The execution layer is compressing. The judgment layer is expanding. This is the core shift.

What I've Actually Seen in Junior Candidates

I've interviewed a lot of junior engineers over the past few years. My honest assessment: this generation is stronger than any cohort I've seen before, in several important ways.

What genuinely impresses me:

Technical curiosity. They adopt new tools fast — AI assistants, new frameworks, emerging practices. They're not waiting for permission to experiment.
Coding fundamentals. Algorithms, data structures, design patterns — most candidates have solid foundations. The baseline has risen significantly.
Communication. They can articulate their thinking clearly and hold their own in technical discussions with senior engineers. This used to be a major gap; it's much less so now.
Real-world exposure. Many have done open source work, side projects, or even early-stage startup experience before they walk into the interview.

Where I still see gaps:

Depth vs. breadth. Broad awareness of many technologies, but thin on fundamentals when you dig. They know of distributed systems; they haven't debugged one at 2am.
Large-scale system intuition. This isn't a criticism — it takes time. But the gap between knowing how a system should work and knowing how it actually behaves under load is enormous, and most junior engineers haven't crossed it yet.
Handling ambiguity. When requirements are unclear or there's no right answer, many junior engineers get stuck. The reflex to ask "what should I build?" rather than "what problem are we solving?" shows up consistently.

The good news: the first two items are learnable. The third is developable. None of them are fatal.

The Competency Map Has Shifted

Here's how I think about the core engineering competencies in the AI era:

Competency	Pre-AI Weight	AI Era Weight
Writing code	High	Lower — AI handles much of execution
System design	High	Higher — judgment can't be automated
Collaboration & influence	Medium	Much higher
Large-scale system experience	High	Still the Junior→Senior dividing line
Engineering methodology	Medium	Critical — AI-native workflows are a force multiplier

The last row is where I see the most opportunity for early-career engineers. Everyone is starting from roughly the same place on AI-native development practices — Spec Driven Development, structured prompting, AI-assisted code review. A junior engineer who masters these workflows early isn't just keeping up; they're potentially ahead of senior engineers who haven't adapted.

Paul Graham captured this dynamic well on X, relaying an observation from a tech CEO:

"Usually 28 year olds are more productive programmers than 22 year olds, because they have more experience. But apparently 22 year olds are now as good as 28 year olds because they're more at ease using AI."

The experience premium is compressing. The engineers who've spent years building up a productivity advantage through accumulated context are seeing that advantage erode — not because experience stopped mattering, but because AI is leveling the execution baseline faster than experience can differentiate it.

A Concrete Roadmap for Early-Career Engineers

Given all this, here's how I'd advise a junior engineer to allocate their development energy:

Code faster, think bigger. Your coding fundamentals are probably fine. Stop optimizing for "writing code faster" and start optimizing for "forming correct technical opinions faster." The ability to quickly learn a new domain and form a defensible point of view is worth more than raw implementation speed.

Invest heavily in system design. This is the clearest gap between junior and senior engineers, and the one AI is least equipped to bridge. Read real architecture case studies — not textbooks, actual post-mortems and engineering blogs from companies operating at scale. Practice turning vague requirements into concrete architectural proposals.

Take collaboration seriously. Influence without formal authority, navigating ambiguity, resolving technical disagreements — these skills are underrated by most junior engineers and overvalued by the market. The engineers who advance fastest are rarely the ones who write the best code in isolation.

Use AI to accelerate your experience curve. This is the real unlock. The traditional path to large-scale system experience takes years of being in the right place at the right time. AI can compress this by letting you simulate design decisions, explore failure modes, and understand system behavior before you've lived it. Use it that way, not as a code autocomplete.

Go deep on AI-native development practices. Vibe Coding, Spec Driven Development, structured context management, agentic workflows — these aren't hype. They're a genuine productivity multiplier, and the playing field is relatively level right now. Junior engineers who master them early will have a real advantage.

The Profiles That Actually Worry Me

Here's what I find more concerning than the junior engineer who's struggling to find their first job: the mid-level engineer who spent five years in a narrow lane, knows just enough to be confident, and hasn't developed genuine depth in anything.

The engineers most at risk from AI aren't the ones just starting out. They're the ones who:

Stayed comfortable in a specific technology stack without developing transferable judgment
Are either afraid of AI tools or entirely dependent on them — neither extreme works
Let their computer science fundamentals atrophy without replacing them with something more durable

Junior engineers have something these people don't: they haven't formed the habits yet. That's actually an advantage.

What This Moment Actually Is

I don't think this is the end of junior engineering. I think it's a recalibration.

The execution layer of software development is being commoditized. The judgment layer — understanding what to build, why it matters, how to architect it at scale, and how to align a team around it — is becoming more valuable, not less.

Early-career engineers who understand this distinction and invest accordingly will find that the AI era is full of opportunity. You have better tools than any previous generation. The bar for what a single engineer can produce is higher than it's ever been.

The question isn't whether there's room for you. There is. The question is whether you're building the capabilities that will matter, or optimizing for the ones that are being automated away.

Working through similar questions about AI and career development? Happy to discuss in the comments.

Does AI/Vibe Coding Really Deliver 10x Productivity?

2026-03-01T00:00:00.000Z

In early 2026, Anthropic published a case study: 16 Claude agents, working in parallel Docker containers, wrote 100,000 lines of Rust code in a few weeks — a C compiler that could successfully compile the Linux kernel. The API bill came to roughly $20,000. By almost any measure, it was an extraordinary result.

Then I mentioned it to a friend of mine, a CTO at a small startup. His response: "The best strategy right now is probably to wait."

That tension — between a genuine technical milestone and a seasoned engineer's skepticism — is what this post is really about.

The Noise Around 10x

The AI productivity conversation has a familiar shape. On one side: breathless claims about 10x engineers, the death of junior developers, and Karpathy's "Software 3.0." On the other: practitioners who've tried the tools and found the reality more complicated than the demos suggest.

He cuts through both extremes well: when an entire industry's revenue depends on harvesting people's anxiety — you'll be left behind if you don't adopt this now, or you can get rich with this immediately — that's a warning sign. And he's right that most people claiming "AI works perfectly now" are overstating it.

But the macro trend is real. The people dismissing it entirely are also wrong. The truth is somewhere more interesting than either camp admits, and I've spent the past several months trying to find it — through a personal finance management system, this blog system, and a resume screening agent.

Why Vibe Coding Disappoints in Practice

"Vibe coding" — Karpathy's term — describes a mode of development where you describe what you want in natural language and let the AI figure out the rest. It sounds liberating. In demos, it looks effortless. Here's what the demos don't show:

Sessions drift. AI models have context windows, and long conversations accumulate noise. After an hour of back-and-forth, the model loses track of constraints established early in the session. The code it writes in hour three doesn't quite match the architecture it agreed to in hour one.

Without tests, AI solves the wrong problem efficiently. This is the killer. If your verification loop is loose — if you're eyeballing the output — the AI will confidently build exactly what you didn't ask for. The Anthropic compiler team spent enormous effort on high-quality test suites precisely because an autonomous AI without a tight feedback loop goes off the rails at speed.

Spec-driven development feels like losing your pairing partner. I tried breaking features into 40+ granular tasks with full specs. It reduced chaos and the AI stayed on track. But it stopped feeling like pair programming. I was waiting, not collaborating. The overhead of maintaining the spec system started eating the productivity gains.

Multi-agent, multi-window multiplies your context switching, not the AI's. I experimented with running multiple Claude instances in parallel. The bottleneck turned out to be me — my own attention was the scarce resource.

None of this means the tools don't work. It means they require a different discipline than the demos suggest.

What Actually Gets You to 10x

After months of daily use across multiple projects, these are the practices that actually move the needle.

1. CLAUDE.md is your foundation. A well-maintained project context file is the single highest-ROI investment you can make. It prevents the AI from repeating mistakes, re-learning your conventions, or contradicting decisions you've already made. Every time the AI does something wrong that it shouldn't have needed to be told, that goes in CLAUDE.md. The compounding effect is real.

2. Plan → Review → Execute, not vibe → fix → vibe. The workflow that works: use planning mode to design the approach, review it before touching code, then execute. I've been combining planning-with-files and superpowers skills for this — persistent markdown plans the AI can read across sessions, with structured review checkpoints. The overhead feels like friction until you realize how much rework you're avoiding.

3. Replace vibe with tools wherever you can. Every time you ask an AI to "figure out" something that's actually deterministic — reading a config, running a query, executing a known command sequence — you're introducing unnecessary uncertainty. MCPs, scripts, and templates convert vague AI judgment calls into reliable tool invocations. The more you constrain the problem space, the more predictable the output.

4. Tests close the loop. Not because testing is virtuous in some abstract sense, but because without automated verification, you can't let the AI work autonomously. Tests are what transform AI from an assistant that needs constant supervision into something that can run unsupervised. Contract-driven development — define the interface, let AI generate tests from it, then implement to pass those tests — has worked well for me.

5. Do less, better. AI makes it easy to generate a lot of things: documentation, boilerplate, edge case handlers, abstractions. Most of it makes the codebase worse. The discipline is subtraction: .claudeignore to keep the context clean, concise docs that stay synchronized, small iterations that stay reviewable. Complexity is cheap to create and expensive to maintain.

6. Agile thinking collapses into one person. The AI era has effectively collapsed the design/dev/test/deploy cycle for solo projects. What used to be sequential ceremonies — sprint planning, standup, review, retrospective — can now happen within a single focused session. Small iterations, tests first, immediate feedback. This is where the gains are most tangible and most reliable.

7. Build and maintain a knowledge base. Architecture decision records, design best practices, conventions your team has settled on — these matter enormously. When the AI has access to well-structured reference material, it stops guessing and starts reasoning from your actual context. A good knowledge base is the difference between an AI that understands your system and one that hallucinates a plausible-looking alternative.

8. Give it an example to copy from. This one is underrated. When I provide a concrete reference — an existing component, a similar API, a prior implementation — the output quality jumps dramatically. AI is exceptionally good at pattern-matching and adaptation. Without an example, it invents its own patterns. With one, it extends yours. The difference in speed and quality is easily 10x.

The honest number? For the right tasks — greenfield features, focused refactors, documentation, test generation — I see something close to 3–5x on a good day. Not every day, not for everything. But consistently, for the tasks where it works, it's genuinely transformative.

From Solo to Team Scale

Here's where it gets complicated.

Individual productivity gains are real. But organizations are not simply collections of individuals, and the math stops being straightforward when you try to reason about what AI means at team or company scale.

If one engineer with AI can do the work of several, the immediate instinct is: great, we need fewer people, or we can do more with the same team. Both are probably partially true. But the dependencies between teams don't shrink just because individuals move faster. An IC who can build ten things still needs product decisions, design reviews, security sign-off, and infrastructure support. Individual speed can surface coordination costs more visibly than it resolves them.

A few specific challenges I see at scale:

AI over-engineers collaboration surfaces. When AI generates perfect documentation, perfectly structured code, and exhaustive test coverage, it paradoxically becomes harder for colleagues to contribute. The bar for a pull request quietly rises. Disagreements become harder to resolve because the AI-generated baseline looks authoritative. Sometimes a messy codebase is easier to change than a pristine one.

Production risk doesn't disappear. An incident in production can cause irreversible damage — data loss, customer impact, compliance violations. AI cannot and should not have autonomous authority over production systems. The human-in-the-loop requirement isn't a limitation to work around; it's the right design. AI plans, humans approve and execute.

Skill atrophy is a real risk. If you rely on AI for every technical decision, what happens when the AI is wrong and you can't tell? The engineers who get the most from AI are the ones who understand what it's doing well enough to catch its mistakes. That requires maintaining real technical depth — which sometimes means doing things the hard way.

Management expectations can be the most dangerous failure mode. "If AI gives us 10x productivity, we should be doing 10x the work" — applied uncritically, this logic leads to burnout, quality collapse, and the loss of conditions that make AI-assisted development work in the first place.

What helps at scale: shared MCP libraries that standardize how AI interacts with your systems, high cohesion and low coupling at the individual level so AI-assisted work has fewer dangerous blast radii, and honest conversations about what AI can and can't be trusted to do unsupervised.

The Honest Answer

AI/Vibe Coding is developing fast and the productivity gains are real. I've experienced them firsthand. But it's an easy-to-start, hard-to-master discipline. The ceiling is high. So is the learning curve.

His instinct to wait isn't wrong, exactly. The hype warrants skepticism, and the tools are still rough around the edges. But they aren't standing still. The gap between an engineer who has figured out how to work well with AI and one who hasn't is widening every month.

The limiting factor isn't the tools anymore. The tools are good enough. The limiting factor is the engineer — their discipline, their taste, their willingness to invest in the feedback loops and context structures that let AI work well.

The question isn't whether AI can deliver 10x. In the right conditions, it can, and then some. The question is whether you've built the conditions.

20 Years of Platform Engineering: Lessons from Building Cloud at Scale

2026-02-26T00:00:00.000Z

Looking back at 20 years in platform engineering feels both humbling and exhilarating. From building RAD tools for web applications in 2000 to managing Kubernetes clusters with 2 million pods today, the journey has been one of continuous learning, adaptation, and growth. This is my story of building platforms at scale, and the lessons I've learned along the way.

The Journey Begins: 2000-2007

I started my career in 2000 at Mainet System Inc. in Shanghai, fresh out of graduate school from Zhejiang University. The dot-com boom was in full swing, and enterprises were just beginning to understand the power of web-based applications.

As a Project Manager and Architect, I worked on building RAD (Rapid Application Development) tools that would help developers build web applications faster. We used Spring, Hibernate, and Struts—technologies that were cutting-edge at the time. Our team developed web-based ERP products including CRM, HRM, and CMS systems for over 40 enterprise customers.

One of my proudest achievements during this period was building and scaling R&D teams across three cities—Shanghai, Shenzhen, and Qingdao—eventually growing to 50+ developers. I learned early on that great platforms aren't just about technology; they're about building great teams.

Joining eBay: The DevEx Revolution (2007-2012)

In 2007, I made a pivotal career move to eBay China. I was tasked with building the Platform Development Experience Tools team from the ground up. Our mission was simple but ambitious: improve productivity for eBay's 2,000+ developers.

We built Raptor IDE, optimized eBay's development environment, and created wizards for frontend, service, and database frameworks. The impact was immediate and measurable. What used to take developers days could now be done in hours. We also developed eBay APIs and SDKs for third-party developers, expanding the platform's reach beyond internal teams.

This experience taught me a fundamental lesson: Developer Experience is Everything. A platform is only as good as the experience it provides to its users. If developers struggle with your tools, they won't adopt them, no matter how technically sophisticated they are.

The Cloud Transformation: Three Generations (2012-2017)

In 2012, I took on a new challenge as Software Development Manager for Cloud at eBay's China Center of Excellence. This marked the beginning of my deep dive into cloud infrastructure.

Leading a team of 20 engineers, we architected and delivered eBay's internal cloud platforms, covering both IaaS and PaaS. This was the era of private cloud, and we went through three distinct generations of infrastructure:

Generation 1: VMware

We started with VMware-based virtualization. It worked, but provisioning a new environment took days, sometimes weeks. The manual processes were a bottleneck.

Generation 2: OpenStack

We then built our own OpenStack-based cloud. I designed a MongoDB-based Configuration Management System (CMS) that became the backbone of our infrastructure. We also built Zebra, a fully automated provisioning system that reduced provisioning time from days to just 10 minutes.

This was a game-changer. Developers could now get environments on-demand, and our infrastructure team could focus on innovation rather than manual toil.

Generation 3: Kubernetes

By 2015-2016, we saw the writing on the wall. Containers and Kubernetes were the future. We started planning our migration to what would become eBay's container platform—Tess.

During this period, I also drove eBay China's Innovation Program, and we won multiple Skunkworks awards. The lesson here: Start Simple, Then Scale. Each generation of our platform built on the lessons of the previous one.

Kubernetes at Scale: The Migration Years (2017-2023)

In 2017, I returned to San Jose to lead the Cloud App Lifecycle team. Our mission was massive: migrate 5,000+ applications from OpenStack to Kubernetes, with zero downtime.

This was platform engineering at its most challenging. We were running:

Applications across multiple regions and 25 availability zones
100+ Kubernetes clusters
2 million+ pods in production
35,000+ deployments per week across 20,000+ app pools

Building the Platform

We developed an end-to-end CI/CD platform using Tekton pipelines. We built an auto-deployment platform that could handle the massive scale of eBay's deployments. And critically, we built a self-healing remediation system called LOM (Lifecycle Operations Manager) that used observability signals to detect and fix issues automatically.

The Regional Migration

One of the most challenging projects was the regional data center migration—exiting PHX and launching RENO—with zero downtime. This required meticulous planning, automation, and coordination across dozens of teams. The fact that we pulled it off without impacting customers remains one of my proudest achievements.

The key lesson: Automate Everything You Can. At this scale, manual processes don't work. Automation isn't just a nice-to-have; it's essential for survival.

Leading Cloud Fleet: The Present (2023-Present)

In 2023, I took on my current role as Software Development Manager for Cloud Fleet Management. This role has been about taking everything we learned and scaling it globally.

Global Expansion

We established engineering hubs in Europe and India, hiring 8+ engineers and building truly distributed teams. Managing global teams across time zones requires different skills than managing co-located teams, but the diversity of perspectives and round-the-clock coverage has been invaluable.

Innovation Programs

We participated in the DoJ and Jade programs, accelerating the stand-up of new availability zones and clusters. We've expanded eBay's Kubernetes infrastructure to 3 regions, 25 AZs, and continue to scale.

AI-Powered Engineering

Perhaps most exciting has been our exploration of AI in platform engineering. We've built 6+ Claude-based hiring skills and piloted spec-driven development. AI isn't replacing engineers—it's making us more productive and allowing us to focus on higher-level problems.

The lesson: Embrace Change and Innovation. The technologies that got us here won't get us where we need to go next.

Five Key Lessons from 20 Years

Looking back, here are the most important lessons I've learned:

1. Developer Experience is Everything

Your platform's success depends on how well it serves developers. Invest in tools, documentation, and support. Make the easy things easy and the hard things possible.

2. Start Simple, Then Scale

Don't over-engineer from day one. Build what you need now, but architect for future scale. Each iteration should be better than the last.

3. Automate Everything You Can

At scale, manual processes break down. Automation isn't just about efficiency—it's about reliability and consistency.

4. Embrace Change

The technology landscape changes rapidly. What's cutting-edge today will be legacy tomorrow. Stay curious, keep learning, and don't be afraid to reinvent your platform when needed.

5. Build Great Teams

Technology is important, but people are more important. Hire well, mentor generously, and create an environment where engineers can do their best work. The best platforms are built by the best teams.

What's Next?

After 20 years, I'm more excited about platform engineering than ever. The next frontiers include:

AI-Native Platforms: How do we build platforms that leverage AI to self-optimize, self-heal, and even self-evolve?
Multi-Cloud Orchestration: As organizations adopt multi-cloud strategies, the complexity of orchestration will increase. How do we abstract this complexity while maintaining flexibility?
Sustainability: As platforms grow, so does their environmental impact. How do we build efficient, sustainable platforms that minimize waste?
Developer Productivity: With AI assistance, what will developer workflows look like? How do platforms need to evolve to support AI-augmented development?

Closing Thoughts

Twenty years ago, I couldn't have imagined managing Kubernetes clusters with 2 million pods. Twenty years from now, today's cutting-edge will seem quaint. That's what makes this field so exciting.

To those starting their careers in platform engineering: embrace the journey. Learn the fundamentals, stay curious, invest in people, and don't be afraid to reinvent yourself every few years. The platforms you build will enable thousands of developers to create amazing things.

And that's what it's all about—building the foundation that empowers others to build the future.

What has your experience been with platform engineering? What lessons have you learned? I'd love to hear your thoughts in the comments below.

Taming AI Agent Uncertainty: What Resume Screening Taught Me About Reliability

2025-12-25T00:00:00.000Z

Same resume. Same job description. Two different scores: 78/100, then 68/100.

I had built a resume-jd-matcher agent to automate candidate screening. On a whim, I ran the same resume through it twice. The inconsistency wasn't just frustrating—it was dangerous. How could I trust hiring decisions based on unpredictable evaluations? How could I ensure fairness to candidates when the same resume might score differently depending on when it was assessed?

The core challenge: AI agents complete tasks differently than traditional programs. They're probabilistic, not deterministic. The same input can produce different outputs due to sampling and contextual variations. In many ways, AI behaves more like human judgment than code execution.

The Familiar Problem

This challenge felt strangely familiar. Human hiring managers face the exact same issues:

Reviewing the same resume today versus tomorrow might yield different impressions
Different interviewers have inconsistent opinions about the same candidate
It's hard to quantify what "good enough" actually means

But HR has spent decades developing best practices to manage subjectivity. Could we apply these same principles to AI agents?

What if the solution to AI uncertainty already exists in how we handle human uncertainty?

How Humans Ensure Objectivity

When designing fair evaluation systems, HR professionals use several proven strategies:

1. Process Decomposition

Break the judgment process into multiple steps. For deterministic steps, use programmatic checks (e.g., "Does the candidate have a bachelor's degree?" → Yes/No). For subjective steps, provide clear criteria.

2. Detailed Scoring Rubrics

Instead of vague assessments ("strong candidate"), use quantified scoring with explicit formulas:

3 years experience = 8 points
4 years experience = 9 points
5+ years experience = 10 points

3. Review and Consistency Checks

Compare assessments to identify inconsistencies. Flag outliers for re-review.

4. Multi-Reviewer Validation

Have multiple people evaluate the same candidate. Aggregate their independent judgments.

5. Reference Benchmarks

Provide calibration examples: "This is what a 90/100 candidate looks like. This is 70/100."

6. Feedback Loops

Track false positives and negatives. Update criteria based on actual outcomes.

These aren't theoretical—they're battle-tested practices that make human evaluation systems work.

Translating HR Principles to AI Agents

Here's how I adapted each HR best practice for my AI agent system:

1. Process Decomposition → Workflows + Script/MCP

Agents naturally support workflows: Break evaluation into stages (information extraction → evidence categorization → score calculation).

Deterministic parts → Tools: For programmatic checks (e.g., calculating years of experience), use Script or MCP tools instead of LLM reasoning. This removes uncertainty from parts of the process that should be deterministic.

# Instead of asking the AI to "estimate years of experience"
# Use a deterministic function:
def calculate_years_of_experience(resume_data):
    experiences = resume_data['work_history']
    total_months = sum(exp['duration_months'] for exp in experiences)
    return total_months / 12

2. Detailed Scoring Rubrics → Explicit Formulas in Prompts

I created a scoring-rubric.md that provides precise formulas for each category:

## Years of Experience (10 points)

if relevant_years >= 5:
    score = 10 points
elif relevant_years >= 3:
    score = 8 + (years - 3) × 1 point
elif relevant_years >= 2:
    score = 6 points
else:
    score = (relevant_years / 2) × 6 points

**Evidence required:** Extract specific job titles, dates, and responsibilities
**Round down when uncertain:** If dates are ambiguous, use the lower estimate

This eliminates ambiguity. Instead of "evaluate experience level," the agent now has clear mathematical guidance.

3. Review and Consistency → Calibration Checks

I added a calibration step to the agent prompt:

**Calibration Check**
- Most similar to: [Reference Example A/B/C/D]
- Score differential: [Higher/Lower than reference by X points]
- Justification: [If differential >10 points, explain why]

The agent must compare every evaluation against benchmark cases and provide reasoning for deviations.

4. Multi-Reviewer → Multi-Agent Validation

Run the same resume through multiple agent configurations:

Agent A (Conservative): Strict standards
Agent B (Balanced): Standard criteria
Agent C (Optimistic): Values potential more

If scores diverge by >10 points, flag for human review.

5. Reference Benchmarks → Example Resume Library

I created reference-resumes.md with 4 calibration examples:

**Example A: 92/100 (Strong Recommend)**
- Profile: Deep kernel expert with 6 years experience
- Key strengths: Led performance optimization team, 15+ merged patches to Linux kernel
- Why 92: Exceeds requirements on technical depth and leadership

**Example B: 76/100 (Recommend with Reservations)**
- Profile: Strong K8s background but weaker kernel depth
- Key strengths: Scaled infrastructure to 10K nodes
- Why 76: Meets most requirements but lacks low-level systems experience

**Example C: 58/100 (Maybe)**
- Profile: Application developer transitioning to infrastructure
- Key strengths: Fast learner, some container experience
- Why 58: Has potential but significant skill gaps

**Example D: 32/100 (Do Not Recommend)**
- Profile: IT support with no programming experience
- Why 32: Fundamental skill mismatch

These anchors help the agent understand what different score ranges represent.

6. Feedback Loops → Miss Tracking

After interviews, I log outcomes:

**Interview Outcome Log**

Candidate: John Doe
Resume Score: 82/100
Interview Result: Rejected (failed system design)

Root Cause Analysis:
- Agent scored Kubernetes experience as 17/20
- Candidate only knew basic kubectl commands
- Discrepancy: Agent gave "knows kubectl" 8/10, should be 3-4/10

Action: Updated rubric with concrete skill level definitions

The Results: Before vs. After

Original Agent

Prompt: Generic instructions ("Evaluate candidate fit for this role")
Scoring: Subjective match levels (Strong/Partial/Weak)
Output: Narrative assessment
Consistency: ±10 point variance on same resume

Improved Agent

Prompt: Explicit workflow with forced rubric reading
Scoring: Quantified formulas across 7 categories (Education, Experience, Kernel/Linux, Programming, Kubernetes, Domain Experience, Leadership)
Output: Tabular breakdown with evidence citations
Consistency: ±4 point variance (60% improvement)

Quantified Impact

Metric	Before	After	Improvement
Score Variance	±10 pts	±4 pts	60% reduction
Explainability	Narrative	Table + Evidence	Qualitative leap
Calibration	None	4 reference cases	N/A
Conservative Scoring	No	Yes (round down when uncertain)	Fewer false positives

Key insight: The variance didn't disappear—it's inherent to AI. But by applying HR principles, I made it predictable and manageable.

The Production Safety Question

This raises a critical question for any business team deploying agents: If AI agent behavior is unpredictable, is it dangerous to use them in production environments where mistakes could cause irreversible damage?

Wrong changes to production systems can cascade into outages, data loss, or security breaches. We need additional safeguards.

Making AI Agents Production-Safe

The same core principles apply, plus extra guardrails:

1. Same Core Principles

Process decomposition: Separate deterministic work (via Skills/MCP using programmatic checks) from AI decisions
Validation and rollback steps: Every change must have a validation phase and rollback plan
Error tracking: Log past mistakes in agent prompts to prevent repetition

2. Additional Production Safeguards

a) AI Plans, Humans Execute

The AI agent generates implementation plans, tests them in staging, but humans approve and execute production changes.

Example Workflow:

Agent: "Here's the migration plan for upgrading the system..."
[Agent runs plan in test cluster]
Agent: "Tests passed. Here's the production runbook."
Human: [Reviews, approves, executes production change]

b) Deterministic Code Generation

For programmatic tasks with clear specs, use AI for "vibe coding":

AI writes code
Tests validate correctness
Code review ensures quality
Code executes deterministically in production

The uncertainty is in the creative phase (writing), not the execution phase (running).

c) Leverage AI's Documentation Advantage

AI excels at documentation. Use it to:

Generate detailed runbooks for every operation
Automatically maintain changelogs
Create post-mortems from incident data

This preserves institutional knowledge and enables auditability.

d) Keep Humans in the Loop

AI is co-pilot, not autopilot:

Human-driven: Humans initiate and guide workflows
Critical step approvals: Dangerous operations require human confirmation
Human review: Humans review documentation and provide feedback
Continuous improvement: Feedback loops from human oversight improve agent behavior

Key safety properties:

Deterministic checks (compatibility, health) use programmatic tools
AI handles planning and documentation (non-destructive)
Humans approve each phase transition
Automatic rollback on health check failure

Best Practices for AI Agent Development

Based on this experience, here are my guidelines for building reliable AI agents:

For Non-Destructive Tasks (e.g., Resume Screening)

1. Quantify Everything

Replace subjective assessments with numerical scores
Provide explicit formulas and thresholds
Cite evidence for every claim

2. Provide Reference Anchors

Include calibration examples in prompts
Define what "excellent" vs. "poor" looks like
Force agents to compare outputs against benchmarks

3. Enforce Consistency Checks

Build calibration steps into workflows
Flag large deviations for review
Track variance over time

4. Default to Conservative

When uncertain, round scores down
Prefer false negatives over false positives
Explicitly acknowledge uncertainty

For Destructive/Production-Impacting Tasks

5. Separate Planning from Execution

AI generates plans and tests them
Humans approve and execute critical changes
Clear separation of concerns

6. Automate Validation

Every change has programmatic health checks
Automatic rollback on failure
No AI decision is final without verification

7. Document Everything

AI generates runbooks for every operation
Changelogs are automatic and detailed
Post-mortems include AI decision reasoning

8. Human Oversight is Mandatory

Dangerous operations require explicit approval
Humans review all AI-generated documentation
Feedback loops: humans correct AI mistakes

9. Fail-Safe Defaults

AI defaults to no-op when uncertain
Irreversible operations require explicit confirmation
Graceful degradation: fall back to manual processes

A Call for Community Input

These are initial best practices based on one use case (resume screening) and considerations for production systems. I'm sharing them to start a conversation, not to claim they're complete.

These are starting points, not final answers.

I invite you to contribute your experience:

Have you built AI agents with consistency challenges?
What techniques have you used to ensure reliability?
What additional safeguards should we consider for production environments?
Where do you see these principles falling short?

The field of AI agent reliability is still emerging. The more we share concrete experiences—both successes and failures—the faster we'll develop robust practices.

Discussion

What challenges have you faced with AI agent consistency? Share your experiences and recommendations in the comments. Let's build a knowledge base together.

If you found this useful, consider sharing it with your team. The conversation around AI reliability benefits from diverse perspectives across different domains and use cases.

[6/6] Claude Code: Conclusion and Future Outlook

2025-12-20T00:00:00.000Z

This is part 6 of a 6-part series on Claude Code.

6. Conclusion and Future Outlook

6.1 Finance Project Development Data Summary

Project Basic Info:

Start Time: November 2025
Development Mode: 100% Claude Code assisted
Code Repository: https://github.com/austinxyz/finance

Code Scale:

Backend Code: 15,748 lines (Java)
Frontend Code: 25,131 lines (Vue/JavaScript/TypeScript, 33 components)
Database: 25 tables, 11 migration scripts
Total Code: Approximately 41,000 lines

Development Efficiency Improvement

Total Time Investment (estimate): Approximately 20 hours net development time

Time Distribution (reference previous chapters):

Requirements refinement and architecture design: ~30%
Rapid iterative development (15-20 min/small feature): ~40%
Testing and experience optimization: ~20%
Documentation and deployment: ~10%

Key Insights:

✅ Repetitive work sees greatest efficiency improvement (CRUD, testing, doc sync)
✅ Cross-layer implementation highly efficient (database + backend + frontend completed at once)
⚠️ Complex business logic needs multiple iterations (like annual summary algorithm)
🎯 Overall efficiency improvement approximately 4-5x

6.2 Core Advantages of Claude Code

Based on Finance project practice, I've summarized Claude Code's three core advantages:

1. Lowering Programming Barriers

Real Experience:

Reduced learning curve for unfamiliar technologies
Stored procedure writing: Previously unfamiliar, Claude helped quickly implement complex aggregation logic
Vue complex pages: Multi-level component interaction, state management, Claude generated initial version then human optimization
Mobile optimization: Responsive design details, Claude applied Tailwind best practices

Conclusion:

AI makes tech stack learning curves gentler, from "need months to learn" to "learn while doing"

2. Accelerating Iteration Speed

Key Data: Finance project iteration rhythm

Average iteration cycle: 1-3 days/major feature module (asset & liability management, expense management)
Typical single feature development: 15-20 minutes (small rapid steps)
Continuous refactoring optimization: No clear Sprint boundaries, continuous improvement

Value of Rapid Iteration:

✅ Quickly validate ideas (avoid wasting time in wrong direction)
✅ Maintain development enthusiasm (see progress, less likely to give up)
✅ Discover problems early (won't find architecture defects only in late project stage)

3. Knowledge Inheritance & Standardization

CLAUDE.md as "Living Documentation"

Problems in traditional development:

New members join: Need old employees to orally pass on project conventions
Project pause then restart: Forgot original design decisions
Inconsistent code styles: Everyone has their own habits

Claude Code Solution:

✅ CLAUDE.md forces AI to follow standards (high consistency)
✅ New sessions auto-load standards (no human memorization needed)
✅ Standards as code (version controllable, reviewable)

Conclusion:

CLAUDE.md + Claude Code = Project's "permanent memory", even if humans forget, AI can help restore context

6.3 Current Limitations

Despite Claude Code's power, encountered some limitations in actual use (detailed scenarios and solutions in previous chapters):

1. Context Management Challenges

Problem: 200K token window easily exhausted in complex feature development, after 10-15 rounds of dialogue Claude starts "forgetting" early content
Response: Regular /clear + /catchup reset, split large tasks, control CLAUDE.md size

2. Unpredictable Code Quality

Problem: Same requirements may generate vastly different quality code (like verbose loops vs elegant Stream API)
Response: In CLAUDE.md clearly state code style preferences, when quality poor directly request regeneration

3. Limited Understanding of Complex Business Logic

Problem: AI excels at technical implementation but doesn't understand business meaning, need human review of key business rules
Conclusion: AI is excellent coding assistant, but not domain expert, complex business logic still needs human design

4. Impact of Project Scale

Observation: Finance project approximately 40,000 lines code, Claude Code performs excellently
Recommendation: Medium projects (100K-1M lines) need carefully designed CLAUDE.md to maintain global understanding

6.4 Recommendations for Readers

Based on Finance project practical experience, I offer following recommendations for different roles:

For Individual Developers

Start using Claude Code immediately, but:

✅ Start with small projects - Don't use directly in work projects, practice in side projects first
✅ Build CLAUDE.md - Create from day one, record every pitfall
✅ Use Planning Mode - Large features (>3 files) must plan first
⚠️ Maintain skepticism - Code review AI output, don't blindly trust
🎯 Regularly clean context - /clear + /catchup is your friend

For Technical Teams

Evaluate then introduce cautiously, recommend:

✅ Pilot Projects - Choose 1-2 non-core projects for 3-month trial
✅ Establish Standards - Unified CLAUDE.md is essential
✅ Build Skills Library - Encapsulate team common operations (deployment, testing, code checking)
⚠️ Security Review - Setup Hooks to enforce code review and testing
🎯 Training & Sharing - Regularly share AI usage experience and pitfall records

For Learners

Use AI as learning tool, not replacement:

✅ Learning Path - Manually write basic code (CRUD) first, then let AI assist advanced features
✅ Comparative Learning - See AI's implementation, compare with your own (learn new techniques)
✅ Understanding First - When encountering code you don't understand, let AI explain (rather than direct copy)
⚠️ Avoid Dependency - Hand-write at least one algorithm weekly (maintain basic skills)
🎯 Project-Driven - Use AI for real projects (rather than just tutorial exercises)

For Enterprise Decision-Makers

AI-assisted programming is trend, but needs systematic introduction:

✅ ROI Assessment - Calculate actual efficiency improvement after pilot (don't only look at promotion)
✅ Security & Compliance - Consult legal and security teams (code ownership, data privacy)
✅ Training Investment - AI tools aren't "buy and use", teams need learning
⚠️ Progressive Migration - From pilot to promotion needs 6-12 months
🎯 Cultural Transformation - Encourage experimentation and sharing (rather than penalizing AI usage mistakes)

Closing Remarks

The Finance project transformed from an idea into a fully-featured, deployable full-stack application, with Claude Code playing the role of "super assistant"—it's not magic (can't replace human thinking and decision-making), but it greatly amplified individual developer capabilities.

My greatest gain wasn't completing this project, but experiencing the shift in programming paradigm:

From "I write code" to "I design, AI implements, I review," from "solitary coder" to "architect collaborating with AI."

As Martin Fowler said:

"AI won't replace programmers, but will redefine what makes an 'excellent programmer.' Future excellent programmers won't be those who write code fastest, but those best at leveraging AI, best at designing systems, and best at controlling quality."

This article is just the beginning. As Claude Code continues evolving and my understanding of AI-assisted programming deepens, I'll continue updating this practical guide.

If you're also exploring AI-assisted programming, welcome to exchange ideas with me (GitHub Issues or email). Let's witness this era's transformation together.

Acknowledgments:

Anthropic team (creators of Claude Code)
Shrivu Shankar (his in-depth experience articles inspired my CLAUDE.md optimization)
Martin Fowler (his interviews helped me understand AI's profound impact on software engineering)

Appendix:

Project GitHub Repository (includes complete CLAUDE.md and Skills)
Martin Fowler Interview: AI's Impact on Software Engineering
Shrivu Shankar's Claude Code Practical Experience
Claude Code Official Documentation
Finance Project Requirements Document

Version History:

v1.0 (2025-12-19): Initial draft completed

[5/6] Claude Code: Use Cases and Limitations

2025-12-19T00:00:00.000Z

This is part 5 of a 6-part series on Claude Code.

5. Use Cases and Limitations of Claude Code

5.1 Best Use Cases

Based on Finance project experience, Claude Code excels in these scenarios:

1. Rapid Prototyping of New Projects

Typical Scenarios:

Converting existing workflows (like Excel/Spreadsheet) to Web applications
Quickly validating product idea feasibility
Exploring new development modes (like Vibe Coding)

Finance Project Validation:

Background: Migrate Spreadsheet financial management to Web app
Tech Stack: Reuse existing project tech stack (Spring Boot + Vue 3)
Development Mode: First try Vibe Coding (completely new development experience)
Zero to runnable MVP: 2 days (traditional approach needs 1-2 weeks)
Basic feature completeness: 85%
Code quality: Can directly enter iteration phase

Key Success Factors:

Tech stack is mainstream (Spring Boot, Vue 3)
Clear requirements (based on existing Spreadsheet workflow)
Vibe Coding mode significantly improves development efficiency

2. Feature Extension of Existing Projects

Typical Scenarios:

Adding reporting module to existing CRM system
Adding comment feature to blog system
Extending API to support new data formats

Finance Project Validation:

Adding "expense budget management" feature:

Involves: 2 new tables, 4 APIs, 3 frontend pages
Traditional estimate: 3-5 days
Actual time: 8 hours (Claude Code assisted)
Code reuse rate: 70% (referenced existing expense module)

Claude Code Advantages:

✅ Understands existing code style (via CLAUDE.md)
✅ Reuses existing components (Service layer, DTO pattern)
✅ Maintains consistency (naming, comments, tests)

3. Systematic Refactoring

Typical Scenarios:

Applying new UI design guidelines to existing pages
Cross-module feature enhancement (like multi-currency support)
Unified UX optimization (like mobile adaptation)

Finance Project Validation:

Task 1: Refactor asset & liability analysis page with expense analysis UI design guidelines

Involves: Style and layout refactoring of 6 Vue components
Traditional: 2-3 days (need repeated comparison and adjustment)
Claude Code: 6 hours
Style consistency: 95%+

Task 2: System-wide multi-currency support

Involves: Database migration, backend Service layer refactoring, frontend component updates
Traditional: 1 week (cross-layer changes, easy to miss)
Claude Code: 2 days
AI auto-identifies all locations needing modification

Task 3: Optimize all UI for mobile experience

Involves: Responsive layout adjustment for 33 Vue components
Traditional: 3-4 days
Claude Code: 1 day
Uniformly apply Tailwind responsive classes

4. Documentation and Test Writing

Typical Scenarios:

Writing unit tests for complex business logic
Generating and syncing project documentation
Updating design docs and database docs after iterations

Finance Project Validation:

Task	Traditional Time	AI Time	Quality Assessment
Unit tests (30 Services)	8 hours	1.5 hours	85%+ coverage
Requirements doc sync (post-iteration)	3 hours	30 minutes	Needs manual review
Design doc update	2 hours	20 minutes	Accurately reflects changes
Database doc sync	2 hours	15 minutes	Complete and accurate
Architecture diagrams (Mermaid)	1 hour	5 minutes	Clear and accurate

Greatest Value:

Test Writing: AI-generated test cases cover more edge cases
Doc Sync: After multiple iterations, doc differences large, AI can quickly align code and docs
Efficiency Boost: Doc maintenance from "most hated task" to "done in 10 minutes"

5.2 Unsuitable or Need Caution Scenarios

1. Complex Business Logic Implementation

Case: Annual financial summary algorithm (Finance actual case)

Requirements: Complex aggregation calculations across accounts, currencies, time (implemented using stored procedure)

Implementation Process (multiple iterations):

Version 1: AI generates basic aggregation logic
- ❌ Didn't consider multi-currency conversion
- ❌ Poor performance (15s response time)
Version 2: Discussed optimization with AI
- ✅ Added currency conversion logic
- ⚠️ Still has performance issues
Version 3: Human intervention in design
- ✅ Refactored to batch operations
- ✅ Optimized to 0.8 seconds

Conclusion:

⚠️ AI can implement complex logic, but needs multiple iterations and human guidance
❌ Initial version often misses key edge cases (like multi-currency, performance optimization)
🎯 Strategy: Step-by-step implementation + continuous discussion + human review of critical logic

2. Performance-Critical Low-Level Code

Case: Large-volume financial report generation

Requirements: Generate trend report with 50+ accounts × 12 months × 5 years = 3000 records, response time requirement less than 1 second

Claude Version 1 (inadequate):

Loop database queries (300 queries)
Response time: 15 seconds

Human Optimization (using stored procedure):

One-time aggregation of all data
Response time: 0.8 seconds
Performance improvement: 18x

Conclusion:

❌ AI tends toward "works is enough" implementation (ignores performance)
⚠️ Performance-critical paths need human design and profiling
✅ AI can generate stored procedure code (given clear requirements)

5.3 Applicability in Team Collaboration Scenarios

While Finance is a personal project, based on Shrivu's enterprise experience and Claude Code characteristics, we can infer applicability to team scenarios:

Suitable Team Scenarios

1. Small Teams (2-5 people)

✅ Unified CLAUDE.md can serve as "code constitution"
✅ Skills can encapsulate team toolchain (deployment, testing, code checking)
✅ Hooks can enforce code standards (like pre-commit testing)

Example: A 3-person startup team's practice

Maintain a 13KB CLAUDE.md (tech stack + coding standards)
5 shared Skills (deployment, database, API testing, doc generation, code formatting)
2 Hooks (tests must pass to commit, sensitive data check)

Results:

New member onboarding: From 2 weeks → 3 days
Code style consistency: 95%+
Repetitive work reduction: 60%

2. Open Source Project Contribution

✅ CLAUDE.md can serve as contributor guide
✅ AI helps new contributors understand codebase
✅ AI generates PRs conforming to project standards

Recommended Practice: In CLAUDE.md explain development workflow, common pitfalls, code standards, helping contributors get started quickly

Unsuitable Team Scenarios

1. Large Enterprises (100+ engineers)

❌ CLAUDE.md hard to cover all teams' special needs
❌ Complex code review process (needs multi-level approval)
⚠️ Security and compliance issues (AI accessing sensitive code)

Solution (Shrivu's enterprise practice):

Each product line maintains its own CLAUDE.md (inheriting company-level standards)
Use Claude Code GHA (GitHub Actions) rather than local CLI (better auditing)
Limit AI access scope (can only access specific codebases)

2. Highly Regulated Industries (Finance, Healthcare)

❌ AI-generated code needs strict compliance review
❌ Code ownership and liability issues (who's responsible for AI-written bugs?)
⚠️ Data privacy issues (code may contain sensitive info)

Recommendations:

Only use AI in non-core, non-sensitive modules
All AI-generated code must have human review + legal/compliance approval
Use privately deployed models (like Claude for Enterprise)

5.4 Common Mistakes & Efficiency Traps

During Finance project development, I made some mistakes that significantly reduced Claude Code efficiency. These lessons worth sharing:

Mistake 1: Key Constraints Not Codified in CLAUDE.md or Skills

Problem Manifestation:

Claude keeps making same mistakes (like using wrong database connection method)
Every session must repeatedly emphasize same rules
Waste lots of tokens correcting errors

Real Cases:

First 3 sessions, Claude always used mvn instead of ./mvnw
Repeatedly forgot to use TimeService.getCurrentTimestamp() causing timezone issues
Database operations always wrote raw SQL instead of using /mysql-exec skill

Solution:

✅ Write recurring rules into CLAUDE.md "Critical Rules"
✅ Encapsulate common operations as Skills (like /setup-java, /mysql-exec)
✅ Use Hooks to enforce key constraints

Mistake 2: Too Much CLAUDE.md Content Causing Token Exhaustion

Problem Manifestation:

Each session quickly prompts token insufficiency
Need frequent /clear to restart
Wait time for summary generation too long (1-2 minutes)

Real Cases:

Initial CLAUDE.md contained lots of detailed example code (20KB+)
Each read consumed 15K+ tokens
5-6 rounds of dialogue then needed session restart

Solution:

✅ Delete example code, keep only rules and principles
✅ Move detailed docs to external files (like docs/api-design.md)
✅ Keep CLAUDE.md within 5-10KB
✅ Use "ALWAYS/NEVER" format concise rules

Mistake 3: Too Large Requirement Steps Causing Frequent Database Changes

Problem Manifestation:

Late-stage frequent database table structure modifications
Each modification affects massive frontend/backend changes
Data migration scripts increasingly complex

Real Cases:

Version 1 expense management didn't consider multi-currency, later addition modified 5 tables
Led to: 10+ APIs need modification, 8 frontend components need updates, 3 stored procedures need rewrite
Data migration script: 300+ lines complex SQL (including data conversion and validation logic)
Another example: Claude initially designed too many tables and fields ("might be useful" redundant design)
Later cleanup work: Delete 4 unused tables, 20+ redundant fields, took 2 days

Solution:

✅ Use Planning Mode to fully design data model
✅ Only implement one small feature at a time, fully test before expanding
✅ Important fields (like currency) consider in version 1
✅ Reference existing module data model design

Mistake 4: Only Testing Functionality Not Reviewing Implementation Causing Technical Debt

Problem Manifestation:

Surface functionality same, but backend implementation completely different
Late-stage refactoring discovered lots of inconsistent code
Hard to maintain and extend

Real Cases:

Asset analysis and expense analysis pages look similar
But backend: One uses stored procedure, one uses Java Service layer
Data retrieval: One real-time query, one cached results
Refactoring found can't unify optimization

Solution:

✅ After each feature complete, briefly review implementation code
✅ Check if follows existing architecture patterns
✅ Similar features should use similar implementation approaches
✅ In CLAUDE.md clearly state architecture decisions (like "aggregation uses stored procedures")

Mistake 5: Improper .gitignore Configuration Causing Sensitive Info Leaks

Problem Manifestation:

Sensitive info (database passwords, API keys) committed to Git
Claude exposes sensitive info in commit messages
Need git filter-branch to clean history (complex and dangerous)

Real Cases:

Initially didn't configure .gitignore, backend/.env file committed
File contained: Database password, email server credentials
Claude-generated commit message: "Add database config with password mysql123"
After discovery needed: Delete historical commits, regenerate keys, force push

Solution:

✅ Configure .gitignore immediately at project initialization
✅ In CLAUDE.md clearly mark: "NEVER commit backend/.env"
✅ Use pre-commit hook to check sensitive files
✅ Regularly review commit messages, avoid exposing sensitive info
✅ Use /git-commit-push skill to auto-generate commit messages (can manually review)

Core Lesson:

Claude Code efficiency highly depends on good project management habits. Lazy early preparation will be repaid double in later stages.

[4/6] Claude Code: Evolution of Development Methodology

2025-12-18T00:00:00.000Z

This is part 4 of a 6-part series on Claude Code.

4. Evolution of Software Development Methodologies in the AI Era

4.1 Agile Development: From Sprints to Prompts

Challenges of Traditional Agile

Typical Scrum Process: Sprint Planning → Daily Standup → Development (2 weeks) → Sprint Review → Retrospective

Problems for Personal Projects:

Formal meeting processes too heavy for individual developers
2-week Sprint too long, unsuitable for fragmented personal time
Difficult to maintain strict process discipline

AI-Driven "Micro-Iteration" Mode

Finance Project's Actual Rhythm:

Iteration Cycle: 1-3 days (not traditional 2 weeks)
Iteration Granularity: Single feature module

Typical Workflow (total 5.5 hours):

Requirements Refinement - Define feature boundaries and user experience (30 minutes)
Architecture Design - Planning Mode design database and interfaces (30 minutes)
Rapid Iteration Loop (3.5 hours, multiple iterations):
- Single development: Implement one small feature (15-20 minutes)
- Immediate optimization: Testing + UI adjustment + Code Review (15-20 minutes)
- Repeat cycle: Complete next small feature, optimize again
- Continuous refactoring: Adjust architecture when issues found
Integration & Release - PR submission + Deployment + Regression testing + Documentation update (1 hour)

Key Differences:

No Formal Meetings - Planning Mode and Git history replace traditional processes
Fast Feedback Loop - Develop today, test tonight
More Time Iterating - 60% time spent refining experience and code quality

Shrivu's Insight: "Shoot and Forget"

Quoting Shrivu Shankar's experience:

"My goal is 'shoot and forget'—set the goal and context, let AI work autonomously, only validate the final PR."

Application in Finance Project:

Inefficient Approach (step-by-step guidance):

You: Create ExpenseBudget entity
Claude: [Generates code]
You: Now add Repository
Claude: [Generates code]
...

→ Every step requires human confirmation, inefficient

Efficient Approach ("Shoot and Forget"):

You: Implement monthly budget settings feature for expense budget (reference expense-requirements.md section 3.1), including:
- Backend: BudgetSettings related Entity, Repository, Service, Controller
- Frontend: BudgetSettingsForm.vue component
- Database: Budget settings table migration script
- Tests: Service layer unit tests

Run tests after completion, fix automatically if tests fail.
Don't ask me every step, make technical decisions yourself.

Note: expense-requirements.md contains multiple feature modules, only implement one small feature each time, not all at once

Key Learnings:

✅ Authorize AI to Make Decisions - Under architecture design guidance, let AI autonomously implement details
✅ Provide Sufficient Context - Reference docs + constraints
✅ Small Feature Iterations - Not delivering all features at once, but completing module by module
⚠️ Set Acceptance Criteria - "Tests pass" + business logic correct

4.2 Test-Driven Development (TDD) New Form

Traditional TDD: Red-Green-Refactor

Classic Process: Write test (Red) → Write implementation (Green) → Refactor

Challenges:

Writing tests is tedious (especially lots of mocks)
Tests also need changes during refactoring (double work)

AI-Assisted "Contract-Driven Development"

New Mode Process:

Define Contract (API interface design) - Manual
Generate Tests - AI auto-generates test cases from interface
Implement Code - AI implements code that passes tests
Human Review - Check coverage and edge cases

Advantages:

✅ Tests first, but humans don't write tests
✅ AI maintains both tests and implementation (updates synchronously during refactoring, more efficient)
✅ Clear contracts, reduced communication costs

Reality: Refactoring still requires test changes, but AI can do it quickly:

Traditional: Manual modify implementation (30 minutes) + Manual adjust tests (30 minutes) = 1 hour
AI-assisted: AI refactor implementation + Sync update tests = 10 minutes

Real Case: When implementing ExchangeRateService.batchConvert(), Claude auto-generated 8 test cases (normal conversion, empty input, edge cases, exception handling, etc.), found and fixed null handling issue, all tests passed. Total time 20 minutes (traditional TDD needs 1-2 hours).

4.3 Object-Oriented Design: Can AI Understand Design Patterns?

Application of Design Patterns

Practice Proves:

✅ AI can correctly apply common design patterns (Strategy, Factory, Observer, etc.)
✅ AI can identify obvious anti-patterns (like God Class, over-coupling)
✅ When explicitly specifying use of certain pattern in architecture design, AI can implement well

Recommended Practice:

In Planning Mode architecture design, clearly state design patterns to use
Provide clear interface definitions and responsibility division
AI will correctly apply design patterns and follow SOLID principles

Case: Exchange rate conversion strategy pattern - In architecture design specified use of strategy pattern to support multiple exchange rate sources (fixed rate, API retrieval, manual input), Claude correctly implemented strategy interface, multiple strategy implementation classes, and priority-based strategy selection logic, also proactively used Optional and Spring dependency injection, conforming to Java best practices.

4.4 Code Review: New Balance of Human-AI Collaboration

Pain Points of Traditional Code Review

Typical Scenario: Submit PR → Wait for Reviewer (1-2 days) → Receive feedback → Modify → Wait again...

Personal Project Dilemma:

No Reviewer (self-reviewing hard to find issues)
Easy to introduce bugs and technical debt

AI as First-Round Reviewer

Workflow:

Claude generates code + self-review - Check code standards, performance, security, test coverage
Claude submits improvement suggestions - Auto-fix found issues
Human final review - Quick scan business logic, check if AI fixes reasonable

Effect Comparison:

Phase	Traditional	AI-Assisted
First-round review	Human (1-2 days)	AI (5 minutes)
Common issue detection rate	70%	90%
Fix time	Human (1-2 hours)	AI (10 minutes)
Final quality	Good	Good

Key Insight:

AI excels at finding technical issues (performance, security, testing), humans excel at finding design issues (maintainability, extensibility, business understanding)

4.5 Build & Deploy: Simplified Automation Processes

Complexity of Traditional CI/CD

Personal Project Pain Points:

Need to remember multiple commands and parameters (Docker build, Git commit, database migration, etc.)
Too many config files, high maintenance cost
Easy to forget a step (like forgetting to update deployment after pushing image)

Finance Project Automation Practices

Core Philosophy: Encapsulate complex processes as simple Skills

Common Skills:

/docker-build-push - Auto-build multi-arch images (amd64/arm64) and push to Docker Hub
/git-commit-push - Auto stage, commit (AI generates message), push
/mysql-exec - Auto-load database credentials and execute SQL
/setup-java - Configure Java environment and load database credentials

Efficiency Comparison:

Task	Traditional	Using Skills
Build+push image	5 minutes (multiple commands)	30 seconds (`/docker-build-push`)
Commit code	2 minutes (3-4 commands)	10 seconds (`/git-commit-push`)
Database migration	1 minute (look up credentials+execute)	5 seconds (`/mysql-exec`)

Core Value:

✅ Reduce Cognitive Load - Don't need to remember complex commands
✅ Reduce Errors - Auto-handle credentials and environment config
✅ Improve Efficiency - Multi-step operations become one-click execution

Key Insight:

Good automation isn't writing more scripts, but making common operations so simple they're "thoughtless"

[3/6] Claude Code: Development Walkthrough

2025-12-17T00:00:00.000Z

This is part 3 of a 6-part series on Claude Code.

3. Complete Development Walkthrough

Using the Finance project as an example, demonstrating Claude Code's practical application in all software development stages.

Finance Project Introduction

Project Positioning: Family financial management system for tracking assets, liabilities, expenses, and investments with multi-currency and multi-member support.

Core Features:

Asset & Liability Management:
- Multiple asset and liability types
- Time-series data model: Preserves historical changes for each asset/liability
- Basic CRUD operations and list displays
Expense Management:
- Hierarchical expense categories (major + subcategories)
- Monthly batch entry with historical data reference
- Three-level drill-down analysis (major category → subcategory → monthly trends)
Multi-Currency Support:
- Multiple mainstream currencies supported
- Real-time exchange rate retrieval (integrated external API)
- All amounts stored in original currency + USD equivalent
Data Visualization:
- Asset & liability trend charts
- Expense analysis with three-level drill-down
- Net worth change curves

Technical Complexity:

Backend Complexity:
- Time-series data modeling (asset/liability historical record management)
- Multi-currency conversion logic (stored procedures + Java Service layer division)
- Cross-table relational queries (assets, liabilities, expenses, exchange rates)
- RESTful API design (30+ endpoints)
Frontend Complexity:
- 33 Vue components, complex form interactions
- Various data visualizations (Chart.js integration)
- Responsive design (mobile adaptation)
Database Complexity:
- 25 tables, including time-series data, multi-currency, soft delete
- Stored procedures for complex aggregation calculations
- Data migration script management

Development Journey:

Phase 1: Asset & liability management (core features)
Phase 2: Asset & liability analysis (trend charts, net worth calculation)
Phase 3: Expense management (categorization, entry, queries)
Phase 4: Expense analysis (three-level drill-down, budget features)
Phase 5: Investment management and analysis (completed)

This progressive feature expansion is a best practice when using Claude Code: Complete the core features first, validate architecture feasibility, then gradually add new features.

Application Interface:

Dashboard interface showing asset & liability overview, trend analysis, and multi-dimensional data visualization

3.1 Initialization Phase: Building Project Skeleton

Traditional Approach vs. Claude Code

Traditional Approach (~2-4 hours):

Create Spring Boot project (Spring Initializr)
Configure Maven/Gradle
Create package structure (controller/service/repository)
Configure database connection (application.yml)
Create Vue project (vue create)
Configure routing, state management, API client
Setup Dockerfile and k8s configs

Claude Code Approach (Recommended) (20 minutes):

$ claude

You: Create a full-stack financial management system, referencing my previous match project (tennis match management) tech stack and directory structure.

**First core feature**: Asset & liability management
- Asset and liability type management
- Time-series data model: Record historical changes for each asset/liability
- Basic CRUD operations and list displays

Why This Approach:

Reference Existing Project: Fix tech stack, avoid AI choosing inconsistent frameworks
Clear Directory Architecture: Ensure clean project structure, easy to expand
Focus on Core Features: Complete most important asset & liability management first, validate architecture
Progressive Expansion: Clear feature iteration sequence (management → analysis → extension)

Real Results:

✅ Generated complete directory structure
✅ Configured Spring Security, JPA, CORS
✅ Created basic entities (User, Family) and corresponding CRUD
✅ Frontend routing, Axios configuration, Tailwind theme
❌ Needed manual adjustment of database connection string (AI used defaults)

Key Learnings:

Let AI generate 80%, manually refine 20% - Initialization phase, AI can quickly build standard architecture
Provide clear tech stack - Vague requirements lead AI to choose outdated or mismatched technologies
Generate first, adjust later - Don't expect perfection first time, rapid iteration is more efficient

Optimization Practice: The Inception of CLAUDE.md

After initialization, I created the first version of CLAUDE.md:

# Finance App - Claude Code Guide

## Stack
Java 17 + Spring Boot 3.2 + Vue 3 + MySQL 8.0

## Critical Rules
- **ALWAYS use Maven wrapper** - `./mvnw` instead of `mvn`
- **NEVER commit backend/.env** - Contains DB credentials

Why Write CLAUDE.md So Early?

✅ Prevent AI from "forgetting" tech stack in subsequent sessions
✅ Record problems encountered on day one (e.g., mvn vs ./mvnw)
✅ Lay foundation for future /clear + /catchup workflow

3.2 Requirements Phase: From Vague Ideas to Precise Specs

My Requirements Workflow (Prevent AI Over-Design)

Step 1: High-Level Requirements

You: I want to implement expense management functionality, don't implement yet, help me break down requirements.

**Core requirements**:
- Expense category management (major + subcategories)
- Monthly batch entry
- Multi-currency support

**I particularly care about**:
- Data entry efficiency (complete all subcategories in one form)
- Historical data reference (show last 3 months)

Step 2: Review AI's Requirements Breakdown

Claude returns detailed requirement list, usually has over-design tendencies:

Claude: Here's the requirements breakdown:
1. Expense category management
   - Major category CRUD
   - Subcategory CRUD
   - Category hierarchy management (major → subcategory)
2. Monthly entry
   - Batch entry form
   - Historical data display (last 3 months)
   - Auto-save draft ❌ (over-design)
   - Template function (copy last month) ❌ (not needed yet)
3. Multi-currency
   - 7 currency support
   - Real-time exchange rate retrieval
   - Exchange rate history ❌ (phase 2)
4. Data validation
   - Uniqueness constraints
   - Amount range validation ❌ (unnecessary)
...

My Response:

You: Requirements 3.3 (exchange rate history), 2.2 (auto-save draft), 2.3 (template function), 4.2 (amount range validation) - don't do these yet.
Keep core features: category management, batch entry, multi-currency, basic validation.

Why This Approach:

✅ Prevent Scope Creep: AI tends to add "useful but non-essential" features
✅ Rapid Iteration: Do minimum viable version (MVP) first, expand based on actual needs
✅ Reduce Complexity: Avoid first version introducing too many features leading to prolonged implementation

3.3 Design Phase: Architecture Decisions & AI Boundaries

After requirements are clear, enter design phase. This is a human-led, AI-assisted phase.

Planning Mode: Systematic Design Process

For complex features (like expense management's three-level drill-down analysis), I use Planning Mode for systematic design:

$ claude --plan

You: Implement three-level drill-down for expense analysis, provide detailed design including:

1. **Data Model Design**
2. **Backend API Design**
3. **Frontend UI Mock**
4. **Test Case Design**

**Feature requirements**:
- Pie chart showing major category proportions
- Click major category → subcategory pie chart for that category
- Click subcategory → monthly bar chart for that subcategory

**UI Reference**:
- First feature: Start simple, pure white background + basic charts
- Subsequent features: Reference implemented AssetAnalysisView.vue style
  (ensure consistent user experience)

Claude's Generated Plan (plan.md):

# Expense Analysis Three-Level Drill-Down - Design Plan

## 1. Data Model Design
- Use existing tables, no new fields needed
- Key field review (amount_usd, period, is_deleted)

## 2. Backend API Design
- GET /api/expenses/analysis/category-summary
- GET /api/expenses/analysis/subcategory/{categoryId}
- GET /api/expenses/analysis/monthly/{subcategoryId}

## 3. Frontend UI Mock
- Reference AssetAnalysisView.vue card layout
- Pie chart showing major category proportions
- Click to drill down to subcategory and monthly trends

## 4. Test Cases
- Data accuracy validation
- Cross-currency aggregation tests
- Drill-down state management tests

My Review Focus:

Data Model: Carefully review each field, confirm necessity
- ✅ No new tables or fields needed
- ✅ Use existing amount_usd for aggregation
UI Mock: Ensure consistency with existing interfaces
- ✅ Reference AssetAnalysisView.vue card style
- ✅ Use same color scheme and layout
API Design: Check naming and parameters are reasonable
- ✅ RESTful style consistent
- ✅ Parameters concise and clear

After Approval:

You: Data model and API design are OK, start implementation.
Reference AssetAnalysisView.vue styles, maintain UI consistency.

Value of Planning Mode:

✅ Data Model Review: Avoid later database migration costs
✅ UI Consistency: Reference existing components, ensure experience
✅ Align Expectations - Confirm approach before writing code
✅ Discover Gaps - AI might think of edge cases you didn't
⚠️ Don't Over-Plan - Simple features (less than 3 files) just do it

Architecture Trade-offs: Human Decisions + AI Provides Options

AI struggles to make good architecture trade-offs independently. My two practical methods:

Method 1: Distill Architecture Principles from Existing System

After the first feature stabilizes, I summarize architecture design principles and write them into CLAUDE.md:

## Architecture Principles

**Time-Series Data**:
- Asset/Liability: NEVER update existing records, always INSERT new ones
- Reason: Historical tracking requirement

**Multi-Currency**:
- Store original currency + converted USD amount
- Use ExchangeRateService for all conversions
- Reason: Audit trail + performance

...

Why Important:

New features automatically follow established principles
Prevent AI from using different architectures for similar features
Reduce long-term refactoring costs

Method 2: Small Feature Experiments + Multi-Option Comparison

For uncertain architecture decisions (like which layer for complex logic), I ask Claude for multiple options:

You: Annual financial summary logic is complex (cross-table joins, currency conversion, time range filtering), give me 3 implementation approaches:
1. MySQL stored procedure
2. Java Service layer calculation
3. Frontend real-time calculation

For each approach, analyze performance, maintainability, testing difficulty.

Don't implement yet, just compare approaches.

Claude's response:

Approach	Performance	Maintainability	Testing Difficulty	Use Case
Stored Procedure	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐	Large data, frequent queries
Java Service	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	Complex logic, need unit tests
Frontend Calculation	⭐	⭐⭐⭐	⭐⭐⭐	Small data, real-time interaction

My Decision Process:

Choose approach 1 (stored procedure) + approach 2 (Java Service) to implement one small feature each
Compare actual results:
- Stored procedure: Query time from 3s down to 0.5s
- Java Service: High unit test coverage, but slow
Final Approach: Hybrid
- Cross-table aggregation → Stored procedure
- Currency conversion + business logic → Java Service
- Frontend only for display

Codify Decision in CLAUDE.md:

**Complex Aggregation**:
- Use MySQL stored procedures for cross-table aggregation
- Java Service layer handles business logic + currency conversion
- Frontend: presentation only

Why This Approach:

✅ Small feature experiment low cost (1-2 hours)
✅ Actual comparison more reliable than theoretical analysis
✅ Once established, subsequent features stay consistent

3.4 Implementation Phase: Efficient Collaboration & Automation

This is where Claude Code delivers the most value.

Plan-based Cross-Layer Implementation

Since I've already made a Plan in the design phase (including data model, API, UI Mock), implementation necessarily involves cross-layer changes. Claude Code's advantages:

✅ Code Standards: Generated code follows project conventions
✅ Cross-Layer Consistency: API return format matches frontend expectations
✅ Automatic Integration: Most cases run without manual adjustments

Task: Add "expense budget" feature (Plan design already complete)

Plan Output (determined in design phase):

Database: expense_budgets table
Backend: BudgetService, BudgetController
Frontend: BudgetManagementView.vue
API: GET/POST/PUT/DELETE /api/budgets

Implementation Order (my standard process):

Database First: Create table and migration script
Backend Implementation: Service → Controller → Unit tests
Frontend Implementation: Component → API calls → Integration tests

Traditional Approach:

Write backend (1-2 hours)
Test API (30 minutes)
Write frontend (1-2 hours)
Integration (30 minutes)
Write database script (30 minutes)

Claude Code Implementation Process (following my standard order):

Step 1: Database

Create migration script, execute /mysql-exec to create table
✅ Database ready

Step 2: Backend Implementation + Tests

Generate Entity, Repository, Service, Controller
Write unit tests and run ./mvnw test
Auto-fix issues found in tests
✅ Tests passing

Step 3: Frontend Implementation

Create Vue component, reference existing style
API calls automatically match backend format
✅ Frontend-backend integration successful

Why This Order Is Efficient:

Database First: Avoid backend multiple table structure changes
Backend Fully Tested: Ensure logic correct before frontend
Frontend Direct Integration: API already validated, frontend implementation fast

Key Observation:

Claude will autonomously run tests and fix errors (prerequisite: project has tests)
Good cross-layer consistency: API return format automatically matches frontend expectations
Follows existing code style: Frontend references ExpenseManagementView.vue, style consistent

The Power of Skills - Automated Workflows

Problem: Every database change requires manual operations (write script, copy password, execute, check)

Solution: Create /mysql-exec Skill, automatically load credentials and execute SQL

Effect:

# Before: 5 manual steps
# Now: 1 command
$ /mysql-exec database/add_budget_table.sql
✓ Executed successfully

Skills vs. Manual Commands Comparison:

Scenario	Manual Commands	Skill
Time Cost	2-3 minutes	10 seconds
Password Leak Risk	High (history)	Low (encapsulated)
Repeatability	Need to remember commands	Consistency guaranteed
AI Usability	Need guidance	Auto-invoked

My Other Skills:

/setup-java - Configure Java 17 + load env variables (must use every session)
/git-commit-push - Atomic stage, commit, push (follows Conventional Commits)
/docker-build-push - Multi-architecture image build (amd64/arm64)

CLAUDE.md Evolution in Implementation Phase

As development deepens, CLAUDE.md added implementation-level constraints:

## Backend Development

**NEVER modify JPA entities without checking existing records**
**ALWAYS use TimeService.getCurrentTimestamp()**

## Frontend Development

**ALWAYS use Composition API** - No Options API
**NEVER hardcode colors** - Use CSS variables

...

Why These Rules Matter?

TimeService rule: Prevented a serious bug (timezone inconsistency caused data chaos)
JPA rule: Prevent AI from directly modifying entities causing data loss
Composition API: Maintain code style consistency

3.5 Testing Phase: Iterative Quality Assurance

My Testing Workflow:

Unit Tests: Part of implementation phase
- After backend Service layer code generated, immediately generate unit tests
- Run ./mvnw test to verify logic correctness
Manual Usage Testing: Discover issues and improve experience
- After completing multi-layer implementation (database + backend + frontend), use it yourself first
- Find bugs → Fix → Test again
- No bugs → Try improving user experience → Back to design and implementation
Multiple Iterations: Until feature solidifies
- Repeat "use → find issues → improve" cycle
- Until feature is stable, experience satisfactory
Integration Testing: Final check before commit
- After feature solidifies, generate integration test cases
- Ensure end-to-end workflow correct
- Only commit code when all integration tests pass

This approach differs from traditional TDD (write tests first then code), better suited for rapid iteration mode in AI-assisted development.

Auto-Generated Unit Tests

Task: Write unit tests for ExchangeRateService

Claude automatically generates test code, covering main scenarios (caching, conversion, batch operations, exception handling)

Results:

✅ Coverage 85%+
✅ Found a bug: Negative amount not validated
❌ External API mock too complex, needed manual simplification

Integration Testing & Hook Validation

Challenge: Prevent AI from committing code when tests fail

Solution: Create Pre-Commit Hook (.claude/hooks/pre-tool-use.sh), force test validation before commit

Value of Hooks:

✅ Forced Validation - AI cannot skip tests
✅ Self-Correction - AI reads hook output and fixes issues
⚠️ Don't Overuse - Too many hooks confuse AI

3.6 Deployment Phase: Docker Containerization

My current deployment strategy:

Docker image build (backend + frontend)
Docker Compose local orchestration
GitHub Actions auto-build and push to Docker Hub

(Kubernetes deployment not yet implemented, planned for future)

Docker Containerization

Claude generated complete Docker configuration:

Backend Dockerfile: Multi-stage build (Maven build + JRE runtime)
Frontend Dockerfile: Vue build + Nginx serving
docker-compose.yml: Complete orchestration of MySQL + backend + frontend

Evaluation:

✅ Multi-stage build reduces image size
✅ Convenient for local development
✅ Clear environment variable management

GitHub Actions Auto-Build

Claude generated complete GitHub Actions workflow, implementing:

Push to master: Test → Build → Push to Docker Hub
Pull Request: Run tests only

Real Usage Effect:

✅ Complete automated CI/CD workflow
✅ Push both latest and commit SHA tags simultaneously

/docker-build-push Skill

To simplify local builds, created /docker-build-push Skill supporting multi-architecture image build (amd64/arm64)

Effect:

# Before: Multiple complex commands
# Now: One command completes
$ /docker-build-push
✓ Building backend (amd64, arm64)...
✓ Building frontend (amd64, arm64)...
✓ Pushed to Docker Hub

[2/6] Claude Code: Feature Overview and Capabilities

2025-12-16T00:00:00.000Z

This is part 2 of a 6-part series on Claude Code.

2.1 Background of Claude Code

Claude Code is the official CLI tool launched by Anthropic in 2024, designed to bring Claude's powerful capabilities directly into developers' command-line environments.

Development Timeline:

Early 2024: Anthropic released Claude 3 series models (Opus, Sonnet, Haiku)
Mid 2024: Launched Claude Code CLI, supporting basic code generation and file operations
Late 2024: Added advanced features like Skills, Hooks, Planning Mode
2025: Became one of the mainstream AI programming tools, forming a competitive landscape with Cursor and Cline

Core Positioning:

Claude Code is not a "code completion tool," but an "AI software engineer assistant."

Essential difference from GitHub Copilot:

Copilot: Provides real-time code suggestions in the editor (passive)
Claude Code: Accepts high-level task descriptions, autonomously executes complete development workflows (active)

Technical Advantages:

Large Context Window: Sonnet 4 supports 200K tokens, sufficient to understand entire small-to-medium projects
Tool Calling Capabilities: Can execute bash commands, read/write files, call APIs, manage git
Programmable Extensions: Deep customization of workflows through Skills and Hooks
MCP Integration: Model Context Protocol supports connecting external data sources and tools

2.2 Core Concepts Overview

Before diving into practice, understand these core concepts:

Concept	Purpose	Analogy	Official Docs
CLAUDE.md	Project constitution for AI, defining rules and constraints	README + coding standards	CLAUDE.md Guide
Skills	Executable scripts encapsulating complex operations	npm scripts	Skills Docs
Slash Commands	Quick prompt templates	IDE code snippets	Commands Docs
Hooks	Intercept and validate AI behavior	Git hooks	Hooks Docs
Planning Mode	Plan approval workflow for large tasks	Technical design review	Planning Mode
Subagents	Delegate subtasks to specialized AI	Microservices architecture	Agents Docs
MCP	Model Context Protocol, connecting external tools	API gateway	MCP Spec

Core Workflow:

User provides requirement → Claude reads CLAUDE.md → Calls Hooks for validation
          ↓
    Executes Skills (automated operations)
          ↓
    Generates code → Runs tests → Commits to git
          ↓
    Returns result → User reviews

2.3 Tool Comparison: Claude Code vs. Competitors

Feature	Claude Code	Cursor	Cline	GitHub Copilot	Gemini CLI
Code Generation	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Command Execution	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐
Context Understanding	⭐⭐⭐⭐⭐ (200K)	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Programmability	⭐⭐⭐⭐⭐ (SDK)	⭐⭐⭐	⭐⭐	⭐⭐	⭐⭐
Security Control	⭐⭐⭐⭐⭐ (Hooks)	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
UI Friendliness	⭐⭐ (Pure CLI)	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐
Learning Curve	Medium	Low	Medium	Low	Medium
MCP Support	✅ Native	❌	✅ Integrated	❌	❌

Real Usage Experience Comparison (based on Finance project):

Claude Code:

✅ Advantage: Complete CLI autonomy, extremely efficient once accustomed, highly customizable
✅ Advantage: Skills and Hooks build powerful workflows (like /mysql-exec, /git-commit-push)
❌ Disadvantage: Initially uncomfortable with pure command-line interface, lacks visual error hints
Suitable for: Developers comfortable with command line, projects requiring high automation

Cline (VS Code Integration):

✅ Advantage: User-friendly interface, intuitive error hints, seamless IDE integration
✅ Advantage: Supports MCP, can connect external tools and data sources
❌ Disadvantage: Weaker customization than Claude Code, dependent on VS Code
Suitable for: Developers accustomed to IDEs, value visual feedback

Cursor:

✅ Advantage: Out-of-the-box, lowest learning cost, excellent UI design
❌ Disadvantage: Weaker programmability, difficult to build complex automation workflows
Suitable for: Personal projects, rapid prototypes, AI programming beginners

Selection Recommendations:

Personal projects/rapid prototypes → Cursor (UI friendly, out-of-the-box)
Enterprise projects/custom needs → Claude Code (programmable, auditable, deep automation)
Team collaboration/need visualization → Cline (VS Code integration, user-friendly interface)
Pure code completion → GitHub Copilot (lightweight, integrated in editor)

2.4 Why Choose Claude Code

Among many AI programming tools, I ultimately chose Claude Code as the primary tool for the Finance project, for the following core reasons:

1. Complete CLI Autonomy

Claude Code can not only generate code, but also:

Execute bash commands (run tests, build projects, start services)
Manage git (stage, commit, push, create branches)
Operate databases (through Skills encapsulating MySQL operations)
Build and deploy (Docker images, Kubernetes configs)

Real Case:

You: Add expense budget feature, including backend API, frontend page, database migration

Claude:
[Create database migration script]
[Execute /mysql-exec database/add_budget_table.sql]
[Generate ExpenseBudget.java, BudgetService.java, BudgetController.java]
[Run ./mvnw test to verify backend]
[Create BudgetManagementView.vue]
[Run npm run build to verify frontend]
[Execute /git-commit-push "feat: add expense budget management"]

Done! Feature complete and pushed to GitHub.

2. Powerful Context Understanding

200K token window (Sonnet 4): Can "see" the entire Finance project
Smart Context Selection: Automatically reads relevant files (via @mentions in CLAUDE.md)
Session Memory: Maintains context consistency across multiple requests

Real Data:

Finance project has 127 Java files and 33 Vue components
Claude Code can understand cross-layer dependencies (e.g., which backend API frontend calls, database table structure)
Generated code maintains naming and architecture consistency (referencing design principles in CLAUDE.md)

3. Programmability (Skills & Hooks)

Encapsulate complex operations through Skills, control AI behavior through Hooks.

My Skills Examples:

/setup-java: Configure Java 17 environment + load database credentials
/mysql-exec: Execute SQL files/queries, automatically read .env credentials
/git-commit-push: Stage → Commit (Conventional Commits) → Push
/docker-build-push: Multi-architecture image build (amd64/arm64)

Hooks Example (prevent AI from committing when tests fail):

# .claude/hooks/pre-tool-use.sh
if [[ "$TOOL_NAME" == "Bash" ]] && [[ "$COMMAND" =~ "git commit" ]]; then
  if [ ! -f /tmp/tests-passed ]; then
    echo "❌ Tests have not passed. Run tests first."
    exit 1
  fi
fi

4. Security and Controllability

Permission Management: Configure which commands AI can execute (e.g., rm -rf)
Hook Interception: Validate before AI executes operations (e.g., must run tests before commit)
Audit Logs: All operations recorded in .claude/logs/
Sandbox Mode: Test AI behavior in isolated environment

Real Results: In 71 commits of the Finance project, not a single one caused code loss or corruption due to AI misoperation.

5. Real Data Validation

In the Finance project, 100% of the code was generated by Claude Code, including:

127 Java backend files (Controllers, Services, Repositories, Models)
33 Vue components (complete frontend interfaces)
25 database tables with all migration scripts and stored procedures
Docker containerization configs and CI/CD workflows

Project Scale:

Backend Code: 15,748 lines (Java)
Frontend Code: 25,131 lines (Vue/JavaScript/TypeScript, 33 components)
Database: 25 tables, 11 migration scripts
Total Code: Approximately 41,000 lines
Development Cycle: November 2025 - present
Net Development Time: Approximately 20 hours (fragmented time on weekends and evenings)
Efficiency Improvement: Estimated 4-5x compared to traditional development

Decision Factors for Choosing Claude Code:

✅ Project requires high automation (backend + frontend + database + deployment)
✅ I'm comfortable with command-line operations, don't depend on IDE
✅ Need programmability (Skills/Hooks) to build standardized workflows
✅ Value security and auditability (enterprise-level requirements)

If you meet these criteria, I also recommend Claude Code:

Comfortable with bash/zsh and other command-line tools
Project requires cross-layer operations (code + database + deployment)
Hope to build reusable automated workflows
Value code quality and security control

2.5 Quick Start

Install Claude Code:

# macOS/Linux
npm install -g @anthropic/claude-code

# Configure API Key
export ANTHROPIC_API_KEY=your-api-key

# Launch
claude

First Task:

You: Create a simple Express.js server, listen on port 3000, return Hello World

Claude: [Generate server.js, package.json, run npm install, start server]

You: Visit http://localhost:3000 to verify

Claude: [Provide test command curl http://localhost:3000]

Recommended Learning Path:

Read official documentation: https://docs.claude.ai/docs/claude-code
Try simple tasks (generate code, run tests)
Learn CLAUDE.md configuration (detailed in Chapter 3)
Create first Skill (examples in Chapter 3)
Configure Hooks to control AI behavior

Next Chapter Preview: Chapter 3 will dive into the complete development lifecycle of the Finance project, showing how to build a production-grade full-stack application from scratch using Claude Code.

[1/6] Claude Code: Introduction to AI-Assisted Development

2025-12-15T00:00:00.000Z

This is part 1 of a 6-part series on Claude Code.

1.1 From Copilot to Vibe Coding

During 2021-2023, AI-assisted programming was primarily "completion-based" tools: GitHub Copilot provided next-line suggestions as you typed code. While revolutionary at the time, this was still the traditional "human writes code, AI assists" model.

In 2024, GitHub Copilot matured and gained widespread adoption. In the first half of 2025, with the emergence of tools like Claude Code, Cursor, and Gemini CLI, a new programming paradigm gradually took shape. By the second half of 2025, the "Vibe Coding" model matured, fundamentally changing the programming paradigm:

Traditional Programming: Developer → Write code → Compile/Test → Debug
Vibe Coding: Developer → Describe requirements → AI generates code → Human review

As Martin Fowler said in his late 2025 interview:

"AI won't replace programmers, but it will change the nature of programming. Future programmers will be more like 'architects + reviewers' rather than 'code craftsmen'."

1.2 What is Vibe Coding

"Vibe Coding" is the community's colorful description of a new generation of AI-assisted programming. The core concept is:

"Shoot and Forget" - Like launching a missile, set the target and context, let AI autonomously complete the task, humans only verify the final PR.

Three Key Characteristics:

High-level Intent Expression - Use natural language to describe "what to do" rather than "how to do it"
Autonomous Task Execution - AI plans steps, writes code, runs tests, submits code
Result-focused Validation - Humans focus on code review and final results, not the process

This model is fundamentally different from traditional "code completion":

Copilot Mode: Humans write code, AI provides next-line suggestions (assistance)
Vibe Coding: Humans provide requirements, AI completes the entire development workflow (leadership)

Term Origin: The term "Vibe Coding" originated from the developer community's playful description of the new AI programming approach, first appearing in 2024 on Twitter/X and Reddit developer discussions. It vividly describes the new paradigm of "completing programming by conveying intent (vibe) rather than writing code."

Advocates:

Anthropic: Provides complete CLI autonomy through Claude Code, the main driver of Vibe Coding
Cursor: Deeply integrates AI into IDE, lowering the barrier to Vibe Coding
Replit: AI Agent mode, users describe requirements to generate complete applications

Current Status (2025):

Has moved from "tech preview" to "production ready" stage
Suitable for personal projects and small-to-medium teams, large enterprises still in pilot phase
Mainstream tools: Claude Code, Cursor, Cline, Gemini CLI
Active community, many best practices and workflows forming

1.3 The Value of Vibe Coding

Why can Vibe Coding change software development efficiency?

1. Eliminates Repetitive Labor

Traditional: CRUD interfaces require manually writing Controller, Service, Repository, Entity, DTO...
Vibe Coding: Describe requirements in one sentence, AI generates all files with consistent naming

2. Reduces Context Switching Costs

Traditional: Backend → Frontend → Database → Deployment, switching between multiple tech stacks
Vibe Coding: AI masters full-stack technology, completes cross-layer changes in one go

3. Enables "Think-speed Programming"

Traditional: From idea to code takes hours or even days
Vibe Coding: From idea to runnable code takes only minutes

Real Case (Finance Project):

Adding expense budget feature (backend API + frontend page + database script)
- Traditional approach: ~4-6 hours
- Vibe Coding: 45 minutes (including testing and fixes)
- Efficiency improvement: 5-8x

4. Focus on "What" Rather Than "How"

In the Finance project, my focus shifted from "writing code" to:

Requirement iteration and UI refinement (35%) - New ideas naturally emerge after seeing the interface, rapid iteration
Architecture refactoring and pattern application (30%) - While adjusting one interface, refactor similar features
Learning best practices and workflow optimization (20%) - Explore best workflows with Claude Code, reduce wait times
Code review and quality control (15%) - Ensure generated code meets standards

Real Experience:

Early on, I took many detours, wasting time waiting for AI generation and debugging
As I became familiar with the tools (Skills, Hooks, CLAUDE.md), efficiency improved significantly
The iteration speed of "see interface → generate idea → immediately implement" is incomparable to traditional development

This is what Martin Fowler meant by "programmers more like architects + reviewers," but the actual work is more flexible—a hybrid role of architect + product manager + reviewer.

1.4 The Value of This Article

While there are many introductions to AI programming tools, this article's unique aspects are:

✅ Real Project Validation - All practices come from the Finance project (GitHub: austinxyz/finance)
✅ 100% AI Generated - Approximately 41,000 lines of code entirely generated by Claude Code, true Vibe Coding practice
✅ Full Lifecycle Coverage - Complete software engineering lifecycle from requirements analysis to production deployment
✅ Concrete and Actionable - Provides actual configuration files, slash commands, skills code examples
✅ Theory Meets Practice - References Martin Fowler interviews and Shrivu Shankar's in-depth experience
✅ Data-driven - Net development time approximately 20 hours, all data verifiable

Finance Project Overview:

Tech Stack: Spring Boot 3.2 + Java 17 + Vue 3 + MySQL 8.0 + Docker
Feature Scope: Asset management, liability tracking, expense analysis, investment management, multi-currency support, data visualization
Code Scale: 127 Java files, 33 Vue components, 25 database tables, approximately 41,000 lines of code
Development Mode: Completely using Claude Code, building efficient workflows through Skills, Hooks, CLAUDE.md

Target Audience:

Individual developers wanting to try Vibe Coding
Technical managers considering introducing AI tools to teams
Practitioners interested in software engineering methodology evolution
Full-stack engineers looking to improve development efficiency

[0/6] Claude Code in Practice: A Complete Guide to AI-Assisted Development

2025-12-14T00:00:00.000Z

Exploring best practices for AI-assisted programming (Vibe Coding) through the real development journey of the Finance project

Author: Austin Xu
Project: Family Finance Management System (Spring Boot + Vue 3 Full-Stack Application)
Development Period: November 2025 - Present
Code Scale: ~41,000 lines of code, 127 Java files, 33 Vue components, 25 database tables
Net Development Time: ~20 hours (weekend and evening spare time)
GitHub: https://github.com/austinxyz/finance

Series Overview

This is a complete guide to Claude Code based on real-world experience building a production application. Through developing a family finance management system entirely with Claude Code, I've documented practical insights, best practices, and lessons learned that go beyond typical AI tool tutorials.

What makes this series unique:

✅ Real Project Validation - All practices come from the Finance project
✅ 100% AI Generated - Approximately 41,000 lines of code entirely generated by Claude Code
✅ Full Lifecycle Coverage - Complete software engineering lifecycle from requirements to deployment
✅ Concrete and Actionable - Actual configuration files, commands, and code examples
✅ Theory Meets Practice - References Martin Fowler interviews and industry best practices
✅ Data-driven - Quantified productivity metrics and verifiable results

Part 1: Introduction - A New Programming Paradigm

Discover how AI-assisted programming evolved from code completion to "Vibe Coding"

Topics Covered:

From Copilot to Vibe Coding
What is Vibe Coding and why it matters
The Value of Vibe Coding (with real productivity data)
Understanding this new programming paradigm

Key Takeaway: Learn how to shift from writing code line-by-line to describing what you want and letting AI handle implementation.

Part 2: Claude Code Overview

Comprehensive introduction to Claude Code's capabilities and ecosystem

Topics Covered:

Background and evolution of Claude Code
Core concepts: Skills, Hooks, MCP, CLAUDE.md
Tool Comparison: Claude Code vs. Cursor, Copilot, Gemini CLI
Why Choose Claude Code
Quick Start guide

Key Takeaway: Understand Claude Code's unique strengths and when to use it versus other AI coding tools.

Part 3: Full Development Cycle Walkthrough

Step-by-step walkthrough of building a real application with Claude Code

Topics Covered:

Initialization Phase: Setting up project skeleton
Requirements Phase: From vague ideas to precise specifications
Design Phase: Architectural decisions and AI's boundaries
Implementation Phase: Efficient collaboration and automation
Testing Phase: AI-driven quality assurance
Deployment Phase: From testing to production

Key Takeaway: See exactly how Claude Code handles each stage of development with concrete examples and screenshots.

Part 4: Evolution of Software Development Methodologies

How AI is transforming established software engineering practices

Topics Covered:

Agile Development: From Sprints to Prompts
Test-Driven Development (TDD): New forms in the AI era
Object-Oriented Design: Can AI understand design patterns?
Code Review: New balance in human-AI collaboration

Key Takeaway: Adapt traditional methodologies for the AI era while maintaining engineering rigor.

Part 5: Use Cases and Limitations

Honest assessment of when Claude Code excels and when it struggles

Topics Covered:

Most Suitable Scenarios
- Full-stack web applications
- CRUD-heavy business logic
- Infrastructure automation
- Prototyping and MVPs
Unsuitable or Scenarios Requiring Caution
- High-performance computing
- Complex algorithms
- Novel research problems
Applicability in Team Collaboration
- Individual vs. team usage
- Integration with existing workflows

Key Takeaway: Make informed decisions about when and how to adopt Claude Code in your projects.

Part 6: Conclusion and Outlook

Synthesis of lessons learned and future outlook

Topics Covered:

Development Data Summary of the Finance Project
Core Advantages of Claude Code
Current Limitations and workarounds
Recommendations for Readers
- For individual developers
- For technical managers
- For teams considering adoption

Key Takeaway: Practical roadmap for integrating Claude Code into your development workflow.

Reading Recommendations

For Different Audiences

Individual Developers: Start with Part 1 to understand the paradigm shift, then jump to Part 3 for hands-on walkthrough. Circle back to Parts 2, 4, and 5 as you encounter specific challenges.

Technical Managers: Read Parts 1, 5, and 6 to understand business value, limitations, and team adoption strategies. Skim Part 3 for technical depth.

Students/Beginners: Read in order from Part 1 through Part 6. Try building small projects with Claude Code as you progress through each part.

Experienced Engineers: Start with Part 4 (methodology evolution) to see how AI changes established practices, then read Part 5 (limitations) to understand boundaries. Parts 3 and 6 provide concrete implementation details.

Reading Tips

Read in Order (Recommended): Chapters build on each other and are best understood sequentially
Practice-Oriented: Each chapter includes real-world cases - try implementing them in your projects
Critical Thinking: Opinions are based on personal experience; adapt to your context
Hands-On Practice: Install Claude Code and experiment with small projects while reading

Project Context: Finance System

To make this guide concrete, all examples come from developing a family finance management system:

Tech Stack:

Backend: Spring Boot 3.2 + Java 17
Frontend: Vue 3 + Vite
Database: MySQL 8.0
Deployment: Docker + AWS

Features Implemented:

Asset management (bank accounts, investments, real estate)
Liability tracking (loans, mortgages, credit cards)
Expense analysis with categories
Investment portfolio tracking
Multi-currency support
Data visualization dashboards
Budget planning tools

Development Metrics:

Code Volume: ~41,000 lines
File Count: 127 Java files, 33 Vue components
Database: 25 tables with complex relationships
Time Investment: ~20 hours net development time
AI Contribution: 100% of code generated by Claude Code
Human Contribution: Requirements, architecture decisions, testing, refinement

This real project provides concrete examples throughout the series rather than theoretical scenarios.

Project Repository:

Finance System GitHub - Full source code and documentation

Industry Insights:

Official Documentation:

What You'll Learn

By the end of this series, you'll be able to:

✅ Understand Vibe Coding - The new programming paradigm enabled by AI
✅ Use Claude Code Effectively - Master Skills, Hooks, MCP, and CLAUDE.md
✅ Build Full Applications - Complete development cycles from idea to deployment
✅ Adapt Methodologies - Apply Agile, TDD, and OOP in the AI era
✅ Make Informed Decisions - Know when to use (or not use) AI assistance
✅ Measure Productivity - Quantify the impact of AI on development speed
✅ Avoid Common Pitfalls - Learn from real-world challenges and solutions

Getting Started

Ready to explore AI-assisted development? Start with Part 1: Introduction to AI-Assisted Development →

Or jump to any part that interests you:

Version History:

v1.0 (2025-12-14): Series overview and table of contents
v1.1 (2025-12-18): Initial series completed
v1.2 (2025-12-19): Refined with unified project statistics

This guide represents real-world experience with Claude Code over multiple months of development. Your mileage may vary based on project type, programming languages, and individual workflow preferences.

Austin Xu Blog

DevOps Is a Culture, Not a Team: What I've Learned Building at Scale

DevOps Is a Culture, Not a Team​

Platform Team: The Operationalized Version of DevOps at Scale​

The Ops Lifecycle: More Than Provisioning and Deploying​

IaC: Not Just Terraform, But a Mindset Shift​

CI/CD: What Good Actually Looks Like​

SLO/SLI: Turning Reliability Into a Currency​

High Scalability: Where DevOps Value Compounds​

The AI Factor: DevOps as the Foundation for AI-Native Development​

Stacking OpenSpec and Superpowers: A Combined SDD Workflow

Where OpenSpec Alone Fell Short​

Where Superpowers Fills the Gaps​

The Combined SDLC​

One Feature End-to-End: Mobile UI Optimization​

The Numbers​

tasks.md vs plan.md: What Each Is Good For​

So Is OpenSpec Still Needed?​

Token Cost​

Key Takeaways​

References​

IaC and Kubernetes: The Two-Layer Control Plane for AI Native Infrastructure

Two Types of Desired State​

What IaC Owns in AI Native Infrastructure​

What Kubernetes Owns — And What It Needs to Learn​

At the Infrastructure Layer​

At the Application Layer​

The Converging Pattern​

What This Means in Practice​

From Cloud Native Apps to AI Native Agent Platforms: The Belts Are the Problem

What Cloud Native App Management Actually Was​

What AI Native Apps Actually Look Like​

The Platform Contract Is the Belt​

What the Platform Needs to Provide​

Agent Lifecycle Management​

MCP Servers and the Gateway Layer​

Token Economics and Cost Attribution​

Observability: From Request Tracing to Behavior Tracing​

What Transfers from Cloud Native App Platforms, What Doesn't​

What Transfers​

What Doesn't Transfer​

The Mindset That Doesn't Change​

From Cloud Native to AI Native Infrastructure: An Infra Platform Engineer's Perspective

What Makes a Cloud Native Application​

How Cloud Infra Was Built to Support This​

The Orchestration Layer​

The Platform Layer​

The Operations Layer​

Lessons Learned the Hard Way​

AI Native Applications: What's Actually Different​

How Infra Must Support AI Native​

Rethinking the Scheduler​

The Network Is No Longer About Routing​

Reliability Means Something Different​

Observability at a Different Scale​

What Transfers, What Needs to Be Rebuilt​

What Transfers​

What Needs to Be Rebuilt​

How Platform Engineers Stay Sharp​

Closing: The Medium Changes, the Thinking Doesn't​

The AI-Augmented Engineering Manager: How I Run a Team in 2026

What Actually Takes Up an EM's Time​

Where AI Actually Helps​

Staying Technical​

Planning with Data Instead of Intuition​

Upward Communication​

Understanding Your People Better​

Making Better-Informed Decisions​

Six Workflows That Changed My Job​

1. Upward Management — ChatGPT / Gemini​

2. Internal Customer Management — Agent / MCP / Skills​

3. People Management and Performance Reviews — Glean / NotebookLM​

4. Hiring — Claude Code + Custom Skills Pipeline​

5. Project Management — Claude Code​

6. Remote Team Management — Claude Code​

The Side Effects (Being Honest)​

Beautiful Documents Don't Mean Better Execution​

Tool Fatigue Is Real​

Over-Reliance Degrades Judgment​

What AI Cannot Do for You​

DevOps Is a Culture, Not a Team

Platform Team: The Operationalized Version of DevOps at Scale

The Ops Lifecycle: More Than Provisioning and Deploying

IaC: Not Just Terraform, But a Mindset Shift

CI/CD: What Good Actually Looks Like

SLO/SLI: Turning Reliability Into a Currency

High Scalability: Where DevOps Value Compounds

The AI Factor: DevOps as the Foundation for AI-Native Development

Where OpenSpec Alone Fell Short

Where Superpowers Fills the Gaps

The Combined SDLC

One Feature End-to-End: Mobile UI Optimization

The Numbers

`tasks.md` vs `plan.md`: What Each Is Good For

So Is OpenSpec Still Needed?

Token Cost

Key Takeaways

References

Two Types of Desired State

What IaC Owns in AI Native Infrastructure

What Kubernetes Owns — And What It Needs to Learn

At the Infrastructure Layer

At the Application Layer

The Converging Pattern

What This Means in Practice

What Cloud Native App Management Actually Was

What AI Native Apps Actually Look Like

The Platform Contract Is the Belt

What the Platform Needs to Provide

Agent Lifecycle Management

MCP Servers and the Gateway Layer

Token Economics and Cost Attribution

Observability: From Request Tracing to Behavior Tracing

What Transfers from Cloud Native App Platforms, What Doesn't

What Transfers

What Doesn't Transfer

The Mindset That Doesn't Change

What Makes a Cloud Native Application

How Cloud Infra Was Built to Support This

The Orchestration Layer

The Platform Layer

The Operations Layer

Lessons Learned the Hard Way

AI Native Applications: What's Actually Different

How Infra Must Support AI Native

Rethinking the Scheduler

The Network Is No Longer About Routing

Reliability Means Something Different

Observability at a Different Scale

What Transfers, What Needs to Be Rebuilt

What Transfers

What Needs to Be Rebuilt

How Platform Engineers Stay Sharp

Closing: The Medium Changes, the Thinking Doesn't

What Actually Takes Up an EM's Time

Where AI Actually Helps

Staying Technical

Planning with Data Instead of Intuition

Upward Communication

Understanding Your People Better

Making Better-Informed Decisions

Six Workflows That Changed My Job

1. Upward Management — ChatGPT / Gemini

2. Internal Customer Management — Agent / MCP / Skills

3. People Management and Performance Reviews — Glean / NotebookLM

4. Hiring — Claude Code + Custom Skills Pipeline

5. Project Management — Claude Code

6. Remote Team Management — Claude Code

The Side Effects (Being Honest)

Beautiful Documents Don't Mean Better Execution

Tool Fatigue Is Real

Over-Reliance Degrades Judgment

What AI Cannot Do for You

Build Trust

Make Organizational Judgments

Handle a Crisis

Generate Curiosity

The Framework: Use AI to Manage Information, Use Your Time to Manage People

The Interview Pattern

The Core Difference: Solving the Problem vs. Eliminating the Problem Class