Skip to main content

17 posts tagged with "ai"

View All Tags

Stacking OpenSpec and Superpowers: A Combined SDD Workflow

· 11 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

This is a follow-up to From Vibe Coding to Spec-Driven Development. That post documented introducing OpenSpec into an existing Finance project. This one covers a new project where I stacked OpenSpec with Superpowers from day one.

After three months of running OpenSpec on my Finance project, I'd formed a clear picture of what it's good at and where it struggles. On a personal wiki project I'd also been using Superpowers, and its brainstorming, TDD, and code-review skills were landing real hits.

So I started a new project — a UTR-based tennis team lineup app (tennis-lineup) — specifically to run both tools together and see how they compose. This post is the report.

IaC and Kubernetes: The Two-Layer Control Plane for AI Native Infrastructure

· 11 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

This is Part 3 of a three-part series on AI Native Infrastructure. Part 1 covers GPU cluster management. Part 2 covers agent platform engineering. This post covers IaC and Kubernetes as the two-layer control plane that makes both work at scale.


At hyperscale, managing GPU infrastructure without IaC is not a workflow — it's a liability. Companies like Meta operate GPU clusters at a scale where configuration drift, firmware inconsistency, or an undocumented network topology change can silently degrade a week-long training run. IaC is how you make infrastructure state auditable, reviewable, and reproducible.

But IaC alone isn't sufficient. It's worth asking: what exactly is Terraform managing? And what is it not managing?

The answer to that question reveals something important about how AI Native infrastructure actually needs to be governed — and why Kubernetes, despite not being designed for GPU workloads, remains the right runtime control plane for both the infrastructure layer and the application layer above it.

From Cloud Native Apps to AI Native Agent Platforms: The Belts Are the Problem

· 12 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

This is Part 2 of a three-part series on AI Native Infrastructure. Part 1 covers the infrastructure layer — GPU clusters, schedulers, and hardware platform management. This post covers the application platform layer. Part 3 covers IaC and Kubernetes as a two-layer control plane.


In the late 1800s, when electric motors arrived in factories, most factory owners did the obvious thing: they removed the steam engine in the basement and dropped an electric motor in its place. Same shafts. Same belts. Same building layout. For thirty years, productivity barely improved.

The motor wasn't the problem. The belts were.

The real breakthrough came when a new generation asked a different question: if every machine can have its own motor, why do we need belts at all? Without belts, factories could reorganize around the flow of work rather than the flow of power. The result was transformative — not because the motor was better than the steam engine, but because removing the constraint unlocked an entirely different architecture.

Sri Shivananda's recent piece uses this analogy to describe what's happening with AI adoption today. We have the motor. But most organizations are keeping the belts — plugging AI into existing ticketing workflows, existing PR queues, existing stage-gated planning cycles. The AI works. The surrounding system neutralizes it.

From Cloud Native to AI Native Infrastructure: An Infra Platform Engineer's Perspective

· 14 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

This is Part 1 of a three-part series on AI Native Infrastructure. Part 1 covers the infrastructure layer — GPU clusters, schedulers, and hardware platform management. Part 2 covers the application platform layer. Part 3 covers IaC and Kubernetes as a two-layer control plane.


I've spent the past several years running one of the larger Kubernetes deployments I know of — 200+ clusters, 5,000+ applications, 50,000 nodes, 2 million instances. When the AI wave hit and my team started getting serious about GPU infrastructure, I kept asking myself: how much of what we built actually transfers? Where do we have to start over?

This post is my attempt to answer that question honestly. It's not a technology comparison or a vendor evaluation. It's a practitioner's account of what Cloud Native taught me, where it fell short, and what AI Native infrastructure at the hardware and cluster management layer actually demands.

The AI-Augmented Engineering Manager: How I Run a Team in 2026

· 12 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

Everyone's talking about AI replacing individual contributors. Nobody's talking about what it does to engineering managers.

That asymmetry is interesting to me, because in my experience, EMs stand to gain more from AI than most ICs — or lose more ground if they ignore it. The difference isn't which tools you use. It's whether you use AI to reclaim the time that actually matters, or just use it to make your status updates look better.

Here's my honest accounting of what changed after a year of deliberately integrating AI into how I manage my team.

Twenty Years of Agile, One Year of AI — Here's What Survived

· 10 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

I grew up as a developer reading Martin Fowler and Kent Beck. The Agile Manifesto, the refactoring patterns, test-driven development — these weren't just methodologies I was handed. They were the lens through which I learned to think about software quality, team dynamics, and sustainable delivery.

Now I'm spending significant time with AI coding tools — Vibe Coding, Claude Code, spec-driven workflows — and a question keeps surfacing: do these principles still apply?

My answer, after a hands-on 50K-line project experiment, is yes. Not only do they apply — several of them become load-bearing pillars in an AI-augmented workflow.

[7/6] Claude Code: From Vibe Coding to Spec-Driven Development

· 13 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

This is an extended chapter to the 6-part Claude Code series. The first six chapters documented building a full-stack Finance app using Vibe Coding. This chapter covers what came next.

The first six chapters documented the complete journey of using Claude Code for Vibe Coding — building a full-stack application from scratch and accumulating 40,000 lines of code. Vibe Coding delivered incredible speed, but as the project grew, a structural problem emerged:

AI writes code fast. AI also goes off-track fast.

When you describe a requirement in one sentence, AI might understand 70% of it and then sprint full-speed in that direction for two hours — only for you to realize the core logic is wrong and have to start over.

This isn't theoretical. Before adopting SDD, my real pain points in the Finance project were:

  • Unstructured workflow: I had to remind AI to organize requirements before writing code, otherwise it jumped straight to implementation
  • Missing design documentation: architectural issues only surfaced after implementation, making course corrections expensive
  • Inconsistent code quality: the same requirement could produce wildly different code quality across sessions
  • Tests routinely skipped: Vibe Coding tends toward "get it running first," making tests optional
  • Slow debugging: without clear task boundaries, bugs were hard to locate and back-and-forth with AI was inefficient

This chapter documents a methodology upgrade experiment: introducing Spec-Driven Development (SDD) into the Finance project using OpenSpec, completing three new features, and comparing results against prior Vibe Coding work.

No Junior Engineers? What AI Really Means for Early-Career Developers

· 7 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

There's a narrative spreading through the industry right now: AI is eliminating junior engineering roles, and early-career developers are the first casualties of the automation wave.

After years of interviewing candidates and leading engineering teams, I think this narrative is half right — and dangerously incomplete.

Does AI/Vibe Coding Really Deliver 10x Productivity?

· 9 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

In early 2026, Anthropic published a case study: 16 Claude agents, working in parallel Docker containers, wrote 100,000 lines of Rust code in a few weeks — a C compiler that could successfully compile the Linux kernel. The API bill came to roughly $20,000. By almost any measure, it was an extraordinary result.

Then I mentioned it to a friend of mine, a CTO at a small startup. His response: "The best strategy right now is probably to wait."

That tension — between a genuine technical milestone and a seasoned engineer's skepticism — is what this post is really about.

Taming AI Agent Uncertainty: What Resume Screening Taught Me About Reliability

· 9 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

Same resume. Same job description. Two different scores: 78/100, then 68/100.

I had built a resume-jd-matcher agent to automate candidate screening. On a whim, I ran the same resume through it twice. The inconsistency wasn't just frustrating—it was dangerous. How could I trust hiring decisions based on unpredictable evaluations? How could I ensure fairness to candidates when the same resume might score differently depending on when it was assessed?

The core challenge: AI agents complete tasks differently than traditional programs. They're probabilistic, not deterministic. The same input can produce different outputs due to sampling and contextual variations. In many ways, AI behaves more like human judgment than code execution.