The Question
When does an experimentation culture go too far?
Should your team A/B test everything, or are there important product decisions where experimentation actively hurts? Where does a culture of testing help, and where does it create organizational paralysis?
The 4 Positions
Evidence from the Archive
Airbnb search relevance: 250 experiments, only 8% improved target metric, but collectively drove 6% revenue improvement
Bing ad text change: moving second line to first line generated $100M/year in revenue, languished in backlog because nobody's intuition flagged it
Shopify maintains holdout groups for every experiment and automatically revisits results at 3, 6, 9, and 12 months
30-40% of experiments showing short-term metric lift showed no incremental GMV lift a year later
600+ experiments on streaks in four years, roughly one every other day
Changing CTA from 'Continue' to 'Commit to my goal' produced a massive retention win
Google+ consumed millions of person-hours based on leadership conviction, failed, and was shut down in 2019
Gmail tabbed inbox was validated with a Wizard of Oz test (manually sorted emails behind a facade of HTML) before any real engineering
The Synthesis
The tension between Shopify's "ban KPIs for product teams" and Duolingo's "600 experiments on one feature" dissolves once you see they are solving different problems at different stages of the product lifecycle.
For new product development and bold strategic bets, testing creates gravitational pull toward incrementalism -- the status quo almost always wins short-term. For optimization of proven products at scale, testing is unmatched. The tension dissolves once you see they solve different problems at different lifecycle stages.
The biggest failure mode is measuring experiments too soon. When 30-40% of short-term winners show no incremental impact a year later, the standard two-to-four-week experiment window is actively misleading. A change that looks like a 'loss' at two weeks might be a massive win at six months because it shifted user behavior in ways that compound.
If you A/B test a transformative new feature against the status quo, the status quo almost always wins short-term because users prefer the familiar. This is precisely why Shopify bans KPIs for core product teams. They need freedom to build toward a vision that might look worse in week-two metrics.
Which Approach Fits You?
Answer 3 questions about your situation. We'll match you to the right approach.
What type of product decision are you making?
How long do you typically run experiments before declaring results?
Do you have enough traffic for statistically significant experiments?
Notable Absences
The Bottom Line
Lenny's newsletter on fostering experimentation culture at Airbnb reinforces this from the infrastructure side. Riley Newman, Airbnb's 10th employee and a data scientist, described how aligning around a single north star metric (nights booked) "propelled the company's use of experiments." And Rebecca Rosenfelt recalled experiment roundups where the audience voted on which variant won -- "it shocked everyone that so often, the intuitive variant was not the winner. Part of a data-driven culture is the humility about not knowing the right answer."
The non-obvious insight comes from Shopify's long-term holdouts: the biggest failure mode is not too much testing or too little. It is measuring experiments too soon. When 30-40% of short-term winners show no incremental impact a year later, the standard two-to-four-week experiment window is actively misleading. A change that looks like a "loss" at two weeks might be a massive win at six months because it shifted user behavior in ways that compound over time.
Sources
- Ronny Kohavi — "The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)" — Lenny's Podcast, July 27, 2023
- Archie Abrams — "Breaking the rules of growth: Why Shopify bans KPIs, optimizes for churn, prioritizes intuition, and builds toward a 100-year vision | Archie Abrams (VP Product, Head of Growth at Shopify)" — Lenny's Podcast, November 7, 2024
- Jackson Shuttleworth — "Behind the product: Duolingo streaks | Jackson Shuttleworth (Group PM, Retention Team)" — Lenny's Podcast, December 15, 2024
- Itamar Gilad — "Becoming evidence-guided | Itamar Gilad (Gmail, YouTube, Microsoft)" — Lenny's Podcast, September 21, 2023