4 guests 5 episodes 1,820 words

Ship It Broken: Why the Best AI Teams Trust the Model More Than Their Guardrails

Should AI products prioritize speed of iteration or reliability?

Six weeks. That's how long it took Intercom to go from "ChatGPT just launched" to a working AI agent prototype. That prototype became Fin, which is on track to cross $100M ARR in under three quarters. Meanwhile, some of their competitors are still building evaluation frameworks.

This isn't a story about recklessness. It's a story about a fundamental shift in what "quality" means when the underlying technology improves on its own. In traditional software, shipping broken code stays broken until you fix it. In AI, shipping an imperfect model today means shipping a better model tomorrow -- if you've built the infrastructure to swap models in. The teams that understood this distinction first are now winning.

The question is urgent because the AI product landscape is splitting into two camps. One camp is shipping aggressively, accepting imperfection, and iterating in public -- trusting that model improvements will fix today's rough edges. The other camp is building elaborate evaluation frameworks, guardrail systems, and testing infrastructure before letting users anywhere near the product. Both camps have billion-dollar examples on their side. Both camps also have spectacular failures.

Should AI products prioritize shipping speed (accepting imperfection and iterating publicly) or invest in reliability infrastructure first (evals, guardrails, testing) before going to market?

The 3 Positions

1 / 3

Iterate in Public -- Model Maximalism Over Excessive Scaffolding

Kevin Weil

Advocates: Kevin Weil (OpenAI)

OpenAI's CPO Kevin Weil articulates the most philosophically committed version of the speed argument. The company calls it "iterative deployment" -- the idea that because everyone is learning about these models together, the best approach is to ship and learn in public rather than trying to anticipate every failure mode in private.

"We have this philosophy, we call iterative deployment, and the idea is we're all learning about these models together. So there's a real sense in which it's way better to ship something even when you don't know the full set of capabilities and iterate together in public."
Kevin Weil
▶ 00:30:42

"The models are not perfect. They're going to make mistakes. You could spend a lot of time building all kinds of different scaffolding around them."
Kevin Weil
▶ 00:31:12

Weil's framing of "model maximalism" is key. Instead of wrapping imperfect models in layers of defensive code (scaffolding), trust the model more and build less around it. The scaffolding becomes technical debt faster than traditional code because models improve so rapidly.

This is the most counterintuitive part of the AI product playbook. In traditional software, shipping "broken" code creates technical debt that compounds. In AI, the scaffolding you build around the model becomes technical debt because the model itself improves. The eval you wrote for GPT-3.5 is irrelevant for GPT-4. The guardrail you built for a 70B parameter model is overkill for a model ten times larger. Building less around the model isn't laziness -- it's architectural pragmatism.

Example: OpenAI shipped GPT-4 with known limitations, learned from real-world usage patterns, and used that data to improve both the model and the product -- a cycle that couldn't have happened in a lab. The real-world data was worth more than any amount of pre-launch testing.

When This Applies

Ship in weeks, not months. The window closes fast. Intercom's six-week prototype is the benchmark.

Speed Is Existential -- Six Weeks Can Be the Difference Between Life and Death

Eoghan McCabe · Geoff Charles

Advocates: Eoghan McCabe (Intercom), Geoff Charles (Ramp)

Eoghan McCabe's Intercom story is the most visceral case for speed. The company had been stagnating. ChatGPT launched. Within six weeks, they had a prototype of Fin, their AI customer support agent.

"Fin is our AI agent who will pass 100 million ARR with Fin in less than three quarters."
Eoghan McCabe
▶ 00:00:15

That's not a theoretical argument for speed -- it's a $100M revenue line created by moving fast. McCabe's point isn't that reliability doesn't matter. It's that in a technology discontinuity, the window for establishing yourself closes before you can build perfect infrastructure.

Geoff Charles at Ramp extends this into an operational philosophy. Everything at Ramp revolves around velocity -- not just in AI, but in every decision.

"Everything we do is around velocity. This means high-velocity decision-making, keeping things simple, reducing team sizes and dependencies, maintaining extreme focus and work ethic."
Geoff Charleson Lenny's Podcast (on Ramp's velocity-first philosophy)

The velocity philosophy at Ramp is worth unpacking because it challenges the assumption that speed and quality are tradeoffs. Charles's argument is that velocity forces simplicity: small teams, few dependencies, extreme focus. And simplicity produces both speed and quality. The bloated process, the long approval chains, the committee-reviewed specs -- those don't produce reliability. They produce slowness and mediocrity. Ramp's bet is that a small team moving fast will outperform a large team moving carefully, even in a regulated domain.

Example: Ramp has become the fastest-growing SaaS company of all time while operating in financial services -- a domain where you'd expect reliability to trump speed. They've proven the two aren't opposed. Their velocity comes from structural choices (small teams, minimal dependencies), not from cutting corners on quality.

When This Applies

Ship fast, eval later. Your users are technical and can handle imperfection. The feedback loop is tighter.

Measure Before You Move -- Speed Without Evals Is Just Waste

Chip Huyen

Advocates: Chip Huyen (Nvidia/Stanford)

Chip Huyen offers the necessary counterweight. Her argument isn't anti-speed -- it's anti-waste. Shipping fast without understanding where your model fails means you're generating noise, not signal. The question isn't "how many evals should I build?" but "how many do I need for high confidence on my core use case?"

"I don't think it's how many eval should I get, but how many eval do I need to get a good coverage, a high confidence in my application's performance and also to help me understand where it is not performing well so that I can fix it."
Chip Huyen
▶ 00:31:43

Huyen's framework is pragmatic: focus evals on the most common user journey first. You don't need comprehensive coverage on day one. You need enough coverage to know whether your core path works.

The distinction between "how many evals" and "which evals" is critical. Many teams build broad eval suites that cover dozens of edge cases while their core user journey goes unmonitored. Huyen's framework is Pareto-driven: identify the user journey that accounts for 80% of value delivered, and build high-confidence evals for that path first. Everything else can wait. This isn't anti-reliability -- it's targeted reliability that doesn't slow down shipping.

The practical implication: if you're building an AI writing assistant, your first evals should cover the most common writing task your users perform, not the exotic edge case of writing code in three languages simultaneously. Get confidence on the core path. Ship. Then expand coverage.

Example: An AI coding assistant that has no eval for its most common use case (autocomplete) but extensive evals for edge cases is doing reliability theater, not reliability engineering.

When This Applies

Build evals for your top 3 user journeys. You need confidence on the core path before expanding.

Evidence from the Archive

Intercom

Intercom building a working Fin prototype in six weeks after GPT-3.5 launched

Early Fin economics where each 99-cent resolution cost 120 cents, with conviction that costs would drop

Eoghan McCabe · CEO ▶ 00:00:15

OpenAI

Bolt/StackBlitz working on their product for seven years until Sonnet 3.5 made it suddenly viable

OpenAI's Deep Research product where evals were co-designed with the product from day one

Kevin Weil · CPO ▶ 00:30:42

Nvidia / Stanford

Chip's viral chart contrasting what people think improves AI apps vs. what actually does

Companies rationally choosing to build a new feature rather than improve an existing feature from 80% to 82% accuracy via evals

Chip Huyen · AI Engineer, Author ▶ 00:31:43

Ramp

Ramp processing over $10 billion in spending on the platform while maintaining a velocity-first culture

A five-person team building a Bill.com competitor in three months that moves billions of dollars per year

Geoff Charles · VP of Product ▶ Watch

The Synthesis

The speed-vs-reliability framing is a false dichotomy, but not in the wishy-washy "you need both" sense. The real insight is that speed and reliability operate on different timescales in AI products, and the timescale determines which to prioritize.

Temporal Priority

When should you prioritize speed over reliability in AI products?

Two Common Failure Modes

How do companies get the speed-reliability balance wrong?

Reliability Redefined

Why is AI reliability fundamentally different from traditional software reliability?

Speed and reliability operate on different timescales. In the first weeks and months, speed dominates -- the model will improve and your scaffolding will become obsolete. Once you have shipped and found users, the equation flips and you need evals for your core path. The right answer is temporal, not balanced.

Companies either build elaborate eval frameworks before they have users (over-investing in reliability) or scale to millions of users with no understanding of failure modes (under-investing in reliability). Pre-launch: optimize for speed. Post-launch with real users: invest in evals for the top user journey immediately.

In traditional software, reliability means same input produces same output. In AI products, reliability means output stays within an acceptable range of quality for the user's context. Your eval infrastructure needs to measure quality distributions, not binary pass/fail. 'Good enough for this use case' is a legitimate reliability standard.

Which Approach Fits You?

Answer 3 questions about your situation. We'll match you to the right approach.

Question 1

What triggered your AI product initiative?

Question 2

What is your domain's error tolerance?

Question 3

How fast are your competitors moving?

Notable Absences

The Bottom Line

There's a second non-obvious insight: **the type of reliability that matters in AI products is different from traditional software reliability.** In traditional software, reliability means "does the same input produce the same output?" In AI products, reliability means "does the output stay within an acceptable range of quality for the user's context?" This is a fundamentally different problem. It means your eval infrastructure needs to measure quality distributions, not binary pass/fail. It means "good enough for this use case" is a legitimate reliability standard. And it means your reliability infrastructure must evolve with the model, not just with your code.

The companies getting this wrong are building elaborate eval frameworks before they have users (over-investing in reliability) or scaling to millions of users with no understanding of failure modes (under-investing in reliability). The right answer is temporal, not balanced.

Sources

Kevin Weil — "OpenAI’s CPO on how AI changes must-have skills, moats, coding, startup playbooks, more | Kevin Weil (CPO at OpenAI, ex-Instagram, Twitter)" — Lenny's Podcast, April 10, 2025
Chip Huyen — "Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)" — Lenny's Podcast, October 23, 2025
Eoghan McCabe — "How Intercom rose from the ashes by betting everything on AI | Eoghan McCabe (founder and CEO)" — Lenny's Podcast, August 21, 2025
Geoff Charles — "Velocity over everything: How Ramp became the fastest-growing SaaS startup of all time | Geoff Charles (VP of Product)" — Lenny's Podcast, August 6, 2023