Skip to main content

8 posts tagged with "cloud-computing"

View All Tags

DevOps Is a Culture, Not a Team: What I've Learned Building at Scale

· 13 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

Every organization that has gone through a "DevOps transformation" in the last decade has a story. Most of those stories end the same way: they hired a DevOps team, bought a set of tools, and then wondered why things didn't meaningfully change.

I've been building and running infrastructure at scale for 20 years — from private cloud on OpenStack at eBay to managing 200+ Kubernetes clusters, 50,000 nodes, and 5,000+ applications. If there's one thing I've learned, it's that the most common implementation of DevOps is actually an anti-pattern.

Let me explain what I mean.

IaC and Kubernetes: The Two-Layer Control Plane for AI Native Infrastructure

· 11 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

This is Part 3 of a three-part series on AI Native Infrastructure. Part 1 covers GPU cluster management. Part 2 covers agent platform engineering. This post covers IaC and Kubernetes as the two-layer control plane that makes both work at scale.


At hyperscale, managing GPU infrastructure without IaC is not a workflow — it's a liability. Companies like Meta operate GPU clusters at a scale where configuration drift, firmware inconsistency, or an undocumented network topology change can silently degrade a week-long training run. IaC is how you make infrastructure state auditable, reviewable, and reproducible.

But IaC alone isn't sufficient. It's worth asking: what exactly is Terraform managing? And what is it not managing?

The answer to that question reveals something important about how AI Native infrastructure actually needs to be governed — and why Kubernetes, despite not being designed for GPU workloads, remains the right runtime control plane for both the infrastructure layer and the application layer above it.

From Cloud Native Apps to AI Native Agent Platforms: The Belts Are the Problem

· 12 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

This is Part 2 of a three-part series on AI Native Infrastructure. Part 1 covers the infrastructure layer — GPU clusters, schedulers, and hardware platform management. This post covers the application platform layer. Part 3 covers IaC and Kubernetes as a two-layer control plane.


In the late 1800s, when electric motors arrived in factories, most factory owners did the obvious thing: they removed the steam engine in the basement and dropped an electric motor in its place. Same shafts. Same belts. Same building layout. For thirty years, productivity barely improved.

The motor wasn't the problem. The belts were.

The real breakthrough came when a new generation asked a different question: if every machine can have its own motor, why do we need belts at all? Without belts, factories could reorganize around the flow of work rather than the flow of power. The result was transformative — not because the motor was better than the steam engine, but because removing the constraint unlocked an entirely different architecture.

Sri Shivananda's recent piece uses this analogy to describe what's happening with AI adoption today. We have the motor. But most organizations are keeping the belts — plugging AI into existing ticketing workflows, existing PR queues, existing stage-gated planning cycles. The AI works. The surrounding system neutralizes it.

From Cloud Native to AI Native Infrastructure: An Infra Platform Engineer's Perspective

· 14 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

This is Part 1 of a three-part series on AI Native Infrastructure. Part 1 covers the infrastructure layer — GPU clusters, schedulers, and hardware platform management. Part 2 covers the application platform layer. Part 3 covers IaC and Kubernetes as a two-layer control plane.


I've spent the past several years running one of the larger Kubernetes deployments I know of — 200+ clusters, 5,000+ applications, 50,000 nodes, 2 million instances. When the AI wave hit and my team started getting serious about GPU infrastructure, I kept asking myself: how much of what we built actually transfers? Where do we have to start over?

This post is my attempt to answer that question honestly. It's not a technology comparison or a vendor evaluation. It's a practitioner's account of what Cloud Native taught me, where it fell short, and what AI Native infrastructure at the hardware and cluster management layer actually demands.

How Ops Engineers Can Stay Relevant in the Age of AI: Becoming a Platform Engineer

· 9 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

Two engineers. Two hundred clusters. Half a million nodes. Two million instances. Every year, two major Kubernetes version upgrades across the entire fleet — with zero incidents.

That's not a team of twenty. That's two people. And the reason it was possible isn't the tooling. It's the way we thought about the problem.

After years building Cloud Platform at a large e-commerce company and interviewing dozens of engineers for Platform roles, I've noticed a pattern. Most candidates who call themselves "DevOps" or "Cloud Operations" engineers are skilled, hardworking, and technically capable. But there's a fundamental difference in how they think — and that difference determines whether you're managing problems forever, or systematically eliminating them.

Twenty Years of Agile, One Year of AI — Here's What Survived

· 10 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

I grew up as a developer reading Martin Fowler and Kent Beck. The Agile Manifesto, the refactoring patterns, test-driven development — these weren't just methodologies I was handed. They were the lens through which I learned to think about software quality, team dynamics, and sustainable delivery.

Now I'm spending significant time with AI coding tools — Vibe Coding, Claude Code, spec-driven workflows — and a question keeps surfacing: do these principles still apply?

My answer, after a hands-on 50K-line project experiment, is yes. Not only do they apply — several of them become load-bearing pillars in an AI-augmented workflow.

Does AI/Vibe Coding Really Deliver 10x Productivity?

· 9 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

In early 2026, Anthropic published a case study: 16 Claude agents, working in parallel Docker containers, wrote 100,000 lines of Rust code in a few weeks — a C compiler that could successfully compile the Linux kernel. The API bill came to roughly $20,000. By almost any measure, it was an extraordinary result.

Then I mentioned it to a friend of mine, a CTO at a small startup. His response: "The best strategy right now is probably to wait."

That tension — between a genuine technical milestone and a seasoned engineer's skepticism — is what this post is really about.

20 Years of Platform Engineering: Lessons from Building Cloud at Scale

· 7 min read
Austin Xu
Cloud Platform Engineering Leader @ eBay

Looking back at 20 years in platform engineering feels both humbling and exhilarating. From building RAD tools for web applications in 2000 to managing Kubernetes clusters with 2 million pods today, the journey has been one of continuous learning, adaptation, and growth. This is my story of building platforms at scale, and the lessons I've learned along the way.