Ship It Broken: Why the Best AI Teams Trust the Model More Than Their Guardrails
Should AI products prioritize speed of iteration or reliability?
Six weeks. That's how long it took Intercom to go from "ChatGPT just launched" to a working AI agent prototype. That prototype became Fin, which is on track to cross $100M ARR in under three quarters. Meanwhile, some of their competitors are still building evaluation frameworks.
This isn't a story about recklessness. It's a story about a fundamental shift in what "quality" means when the underlying technology improves on its own. In traditional software, shipping broken code stays broken until you fix it. In AI, shipping an imperfect model today means shipping a better model tomorrow -- if you've built the infrastructure to swap models in. The teams that understood this distinction first are now winning.
The question is urgent because the AI product landscape is splitting into two camps. One camp is shipping aggressively, accepting imperfection, and iterating in public -- trusting that model improvements will fix today's rough edges. The other camp is building elaborate evaluation frameworks, guardrail systems, and testing infrastructure before letting users anywhere near the product. Both camps have billion-dollar examples on their side. Both camps also have spectacular failures.
Should AI products prioritize shipping speed (accepting imperfection and iterating publicly) or invest in reliability infrastructure first (evals, guardrails, testing) before going to market?
The 3 Positions
Evidence from the Archive
Intercom building a working Fin prototype in six weeks after GPT-3.5 launched
Early Fin economics where each 99-cent resolution cost 120 cents, with conviction that costs would drop
Bolt/StackBlitz working on their product for seven years until Sonnet 3.5 made it suddenly viable
OpenAI's Deep Research product where evals were co-designed with the product from day one
Chip's viral chart contrasting what people think improves AI apps vs. what actually does
Companies rationally choosing to build a new feature rather than improve an existing feature from 80% to 82% accuracy via evals
Ramp processing over $10 billion in spending on the platform while maintaining a velocity-first culture
A five-person team building a Bill.com competitor in three months that moves billions of dollars per year
The Synthesis
The speed-vs-reliability framing is a false dichotomy, but not in the wishy-washy "you need both" sense. The real insight is that speed and reliability operate on different timescales in AI products, and the timescale determines which to prioritize.
Speed and reliability operate on different timescales. In the first weeks and months, speed dominates -- the model will improve and your scaffolding will become obsolete. Once you have shipped and found users, the equation flips and you need evals for your core path. The right answer is temporal, not balanced.
Companies either build elaborate eval frameworks before they have users (over-investing in reliability) or scale to millions of users with no understanding of failure modes (under-investing in reliability). Pre-launch: optimize for speed. Post-launch with real users: invest in evals for the top user journey immediately.
In traditional software, reliability means same input produces same output. In AI products, reliability means output stays within an acceptable range of quality for the user's context. Your eval infrastructure needs to measure quality distributions, not binary pass/fail. 'Good enough for this use case' is a legitimate reliability standard.
Which Approach Fits You?
Answer 3 questions about your situation. We'll match you to the right approach.
What triggered your AI product initiative?
What is your domain's error tolerance?
How fast are your competitors moving?
Notable Absences
The Bottom Line
There's a second non-obvious insight: **the type of reliability that matters in AI products is different from traditional software reliability.** In traditional software, reliability means "does the same input produce the same output?" In AI products, reliability means "does the output stay within an acceptable range of quality for the user's context?" This is a fundamentally different problem. It means your eval infrastructure needs to measure quality distributions, not binary pass/fail. It means "good enough for this use case" is a legitimate reliability standard. And it means your reliability infrastructure must evolve with the model, not just with your code.
The companies getting this wrong are building elaborate eval frameworks before they have users (over-investing in reliability) or scaling to millions of users with no understanding of failure modes (under-investing in reliability). The right answer is temporal, not balanced.
Sources
- Kevin Weil — "OpenAI’s CPO on how AI changes must-have skills, moats, coding, startup playbooks, more | Kevin Weil (CPO at OpenAI, ex-Instagram, Twitter)" — Lenny's Podcast, April 10, 2025
- Chip Huyen — "Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)" — Lenny's Podcast, October 23, 2025
- Eoghan McCabe — "How Intercom rose from the ashes by betting everything on AI | Eoghan McCabe (founder and CEO)" — Lenny's Podcast, August 21, 2025
- Geoff Charles — "Velocity over everything: How Ramp became the fastest-growing SaaS startup of all time | Geoff Charles (VP of Product)" — Lenny's Podcast, August 6, 2023