Metrics
4 guests 6 episodes 2,397 words

The Question

When does an experimentation culture go too far?

Should your team A/B test everything, or are there important product decisions where experimentation actively hurts? Where does a culture of testing help, and where does it create organizational paralysis?

Airbnb, Microsoft, Amazon

Airbnb search relevance: 250 experiments, only 8% improved target metric, but collectively drove 6% revenue improvement

Bing ad text change: moving second line to first line generated $100M/year in revenue, languished in backlog because nobody's intuition flagged it

Shopify

Shopify maintains holdout groups for every experiment and automatically revisits results at 3, 6, 9, and 12 months

30-40% of experiments showing short-term metric lift showed no incremental GMV lift a year later

Duolingo

600+ experiments on streaks in four years, roughly one every other day

Changing CTA from 'Continue' to 'Commit to my goal' produced a massive retention win

Gmail, YouTube, Microsoft

Google+ consumed millions of person-hours based on leadership conviction, failed, and was shut down in 2019

Gmail tabbed inbox was validated with a Wizard of Oz test (manually sorted emails behind a facade of HTML) before any real engineering

The Synthesis

The tension between Shopify's "ban KPIs for product teams" and Duolingo's "600 experiments on one feature" dissolves once you see they are solving different problems at different stages of the product lifecycle.

01
Lifecycle Dependency
Why do Shopify and Duolingo's opposite approaches both work?
02
Measurement Timing
What is the biggest failure mode in experimentation?
03
Incrementalism Gravity
Can A/B testing actually kill transformative ideas?

For new product development and bold strategic bets, testing creates gravitational pull toward incrementalism -- the status quo almost always wins short-term. For optimization of proven products at scale, testing is unmatched. The tension dissolves once you see they solve different problems at different lifecycle stages.

The biggest failure mode is measuring experiments too soon. When 30-40% of short-term winners show no incremental impact a year later, the standard two-to-four-week experiment window is actively misleading. A change that looks like a 'loss' at two weeks might be a massive win at six months because it shifted user behavior in ways that compound.

If you A/B test a transformative new feature against the status quo, the status quo almost always wins short-term because users prefer the familiar. This is precisely why Shopify bans KPIs for core product teams. They need freedom to build toward a vision that might look worse in week-two metrics.

Which Approach Fits You?

Answer 3 questions about your situation. We'll match you to the right approach.

Question 1

What type of product decision are you making?

Question 2

How long do you typically run experiments before declaring results?

Question 3

Do you have enough traffic for statistically significant experiments?

Notable Absences

The Bottom Line

Lenny's newsletter on fostering experimentation culture at Airbnb reinforces this from the infrastructure side. Riley Newman, Airbnb's 10th employee and a data scientist, described how aligning around a single north star metric (nights booked) "propelled the company's use of experiments." And Rebecca Rosenfelt recalled experiment roundups where the audience voted on which variant won -- "it shocked everyone that so often, the intuitive variant was not the winner. Part of a data-driven culture is the humility about not knowing the right answer."

The non-obvious insight comes from Shopify's long-term holdouts: the biggest failure mode is not too much testing or too little. It is measuring experiments too soon. When 30-40% of short-term winners show no incremental impact a year later, the standard two-to-four-week experiment window is actively misleading. A change that looks like a "loss" at two weeks might be a massive win at six months because it shifted user behavior in ways that compound over time.

  1. Ronny Kohavi"The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)" — Lenny's Podcast, July 27, 2023
  2. Archie Abrams"Breaking the rules of growth: Why Shopify bans KPIs, optimizes for churn, prioritizes intuition, and builds toward a 100-year vision | Archie Abrams (VP Product, Head of Growth at Shopify)" — Lenny's Podcast, November 7, 2024
  3. Jackson Shuttleworth"Behind the product: Duolingo streaks | Jackson Shuttleworth (Group PM, Retention Team)" — Lenny's Podcast, December 15, 2024
  4. Itamar Gilad"Becoming evidence-guided | Itamar Gilad (Gmail, YouTube, Microsoft)" — Lenny's Podcast, September 21, 2023
esc
Loading…
navigate filter openesc close