4 guests 6 episodes 2,397 words

The Question

When does an experimentation culture go too far?

Should your team A/B test everything, or are there important product decisions where experimentation actively hurts? Where does a culture of testing help, and where does it create organizational paralysis?

The 4 Positions

1 / 4

Test Everything -- But Accept That Most Ideas Will Fail

Ronny Kohavi

Ronny Kohavi, widely regarded as the world's leading expert on A/B testing, built experimentation platforms at Microsoft (running 20,000-25,000 experiments per year), Amazon, and Airbnb. He is author of the definitive textbook Trustworthy Online Controlled Experiments. His position is both the strongest case for experimentation and the most honest about what it reveals: human intuition about product outcomes is reliably terrible.

The failure rates are humbling and consistent across the industry:

"Overall at Microsoft, about 66%, two thirds of ideas fail. At Bing, which is a much more optimized domain after we've been optimizing it for a while, the failure rate was around 85%. And then at Airbnb, this 92% number is the highest failure rate that I've observed."
Ronny Kohavi
▶ 00:13:31

These are not random ideas from junior employees. These are carefully researched, debated, designed, and built ideas from experienced teams at world-class companies. Booking, Google Ads, and others report similar 80-90% failure rates.

Kohavi's canonical example crystallizes why testing matters. Someone at Bing proposed moving the second line of ad text to the first line -- a "meh" idea that languished in the backlog for months. An engineer eventually just implemented it and launched the experiment. It triggered a revenue alarm:

"That simple idea increased revenue by about 12%. And this is something that just doesn't happen. We can talk later about Wyman's law, but that was the first reaction, which is, 'This is too good to be true. Let's find a bug.' And we did. And we looked for several times, and we replicated the experiment several times, and there was nothing wrong with it. This thing was worth $100 million at the time when Bing was a lot smaller."
Ronny Kohavi
▶ 00:06:37

The biggest revenue idea in Bing's entire history, and nobody rated it as a priority. Kohavi is unsparing about the implications:

"We are often humbled by how bad we are at predicting the outcome of experiments."
Ronny Kohavi
▶ 00:08:34

But Kohavi rejects the charge that experimentation leads only to micro-optimizations. He frames it as portfolio management -- you need incremental experiments compounding over time, and you need high-risk bets:

"I don't think it's possible to experiment too much. You have to allocate sometimes to these high risk, high reward ideas. We're going to try something that's most likely to fail. But if it does win, it's going to be a home run."
Ronny Kohavi
▶ 00:00:22

At Airbnb, 250 experiments in search relevance collectively drove a 6% revenue improvement -- despite 92% of them individually failing to move the target metric. The gains came inch by inch.

When This Applies

when you are making a strategic bet about the future of the product, building something genuinely new with no baseline, or the decision is about brand identity and long-term vision. Shopify's core product teams operate this way -- backed by a CEO (Tobi Lutke) whose taste is the organizational compass.

Core Product Teams Should NOT Have KPIs -- Taste Trumps Data for Vision Work

Archie Abrams

Archie Abrams, VP of Product and Head of Growth at Shopify, leads a 600+ person org at the company that powers roughly 10% of US e-commerce ($235B GMV in 2023). Shopify's approach is genuinely contrarian -- and backed by data that most companies never bother to collect.

Shopify deliberately sets up different parts of the organization to make decisions differently:

"The core product teams don't have metrics or KPIs. They're essentially banned. And instead, decisions are made based on taste, and intuition, and building towards this long-term vision."
Lenny Rachitsky
▶ 00:01:12

But Shopify is not anti-data. Their growth team experiments aggressively, with a twist that should alarm anyone who ships based on two-week experiment results. They maintain holdout groups for every experiment and automatically revisit results at 3, 6, 9, and 12 months:

"There's a lot of longterm monitoring of experiments over these very long time horizons to both inform what those input metrics are and more importantly hold ourselves accountable to, did we actually move what we cared about, which is that longterm GMV, in the right way?"
Archie Abrams
▶ 00:12:37

What they discovered is remarkable. In roughly 30-40% of cases, experiments showing a short-term metric lift showed no incremental lift on GMV a year later. The short-term wins were pull-forward effects -- revenue moved from next month to this month and labeled growth:

"In quite a few cases, you get a lift on a metric up front, a more short-term metric. Number of people who become a paying shopper, number of people who make their first sale in Shopify. And then you look a year later, and there's actually no incremental lift on GMV from that cohort."
Archie Abrams
▶ 00:14:59

Abrams issues a direct challenge to every growth team:

"I would encourage everyone, if you can, look at some of the experiments that you thought were your biggest winners. Look at the downstream metrics for a year, two years on that experiment. And I'll bet you'd be surprised how many times the metric is different than what you thought it would be after a year."
Archie Abrams
▶ 00:14:17

When This Applies

for early-stage concepts. Gmail's tabbed inbox was proven with a Wizard of Oz test -- manually sorted emails behind a facade of HTML -- before any real engineering. Google+ skipped this step and spent millions of person-hours on a product nobody needed.

Test Obsessively When You Have Scale -- 600 Experiments Build a Competitive Moat

Jackson Shuttleworth

Jackson Shuttleworth, Group PM on the Retention Team at Duolingo, leads the team behind the single most impactful feature at a $14 billion company that doubled in value in six months. Duolingo has over 9 million users with a year-plus streak, and Shuttleworth describes streaks as their biggest growth lever.

At this scale, experimentation is not just useful -- it produces insights that no amount of intuition could generate:

"I'd say test everything, we've run in the last four years over 600 experiments on the streaks, so every other day. We've actually set up really good infrastructure for copy testing."
Jackson Shuttleworth
▶ 00:00:30

The findings are often genuinely surprising. One of Duolingo's biggest wins came from a copy change that no product sense would have predicted:

"We used to say continue, our standard CTA is continue, and we changed that to commit to my goal, and it was a massive win."
Jackson Shuttleworth
▶ 00:00:30

The compound effect of testing at this velocity is a deep understanding of human psychology -- how identity framing, loss aversion (via streak freezes), commitment language, and social features interact to build habits. The team layered challenges, goal setting, rewards, and notifications on top of the core streak mechanic, each validated through experimentation. As Lenny observed: "There's so much human psychology that you all learn through all these experiments of just how to motivate people, what motivates, what demotivates."

When This Applies

when you have enough traffic for statistical significance and you are optimizing a proven feature or flow. Duolingo runs 600 experiments on streaks because they have hundreds of millions of users and the feature maps to a measurable daily behavior. Bing's relevance team targets 2% annual improvement through many small compounding wins.

Match Evidence Type to Decision Stage

Itamar Gilad

Itamar Gilad, a former PM at Google who worked on both Google+ and Gmail's tabbed inbox, and author of Evidence-Guided, argues that the real skill is not choosing between testing and intuition -- it is calibrating your confidence and using the right evidence type for each decision stage.

His two Google experiences embody the contrast. Google+ consumed enormous resources based on leadership conviction that users needed another social network. It failed and was shut down in 2019 -- while Google missed the actual opportunity (WhatsApp, Snapchat). The Gmail tabbed inbox team took the opposite approach, using scrappy Wizard of Oz tests before committing:

"You fake it, you do a fake door test, you do a smoke test, Wizard of Oz tests. We used a lot of those in the tabbed inbox by the way, one of the first early versions was actually we showed the tabbed inbox working to people. But it wasn't really Gmail, it was just a facade of HTML."
Itamar Gilad
▶ 00:00:00

The lesson was not about tools but about confidence calibration:

"We didn't have that much confidence in our opinions. We had opinions, we had ideas but we didn't just go all in and just let's build it. We actually used an evidence guided system. I think every successful product company out there that you look at Amazon, Airbnb, anyone you will check, at least in their best periods they found a way to balance human judgment with evidence."
Itamar Gilad
▶ 00:12:45

Gilad's metric tree framework connects micro-experiments to macro-strategy by distinguishing between a North Star metric (value delivered to users) and a top KPI (value captured by the business), then decomposing each into hierarchical trees. This prevents the common failure mode of optimizing a sub-metric that does not actually move what matters.

When This Applies

whenever your scale allows. Shopify's data shows that 30-40% of short-term wins are pull-forward effects, not genuine growth. Most companies evaluate at 2-4 weeks; Shopify evaluates at 1-3 years.

Evidence from the Archive

Airbnb, Microsoft, Amazon

Airbnb search relevance: 250 experiments, only 8% improved target metric, but collectively drove 6% revenue improvement

Bing ad text change: moving second line to first line generated $100M/year in revenue, languished in backlog because nobody's intuition flagged it

Ronny Kohavi · VP/Technical Fellow (Experimentation) ▶ Watch

Shopify

Shopify maintains holdout groups for every experiment and automatically revisits results at 3, 6, 9, and 12 months

30-40% of experiments showing short-term metric lift showed no incremental GMV lift a year later

Archie Abrams · VP Product, Head of Growth ▶ 00:01:12

Duolingo

600+ experiments on streaks in four years, roughly one every other day

Changing CTA from 'Continue' to 'Commit to my goal' produced a massive retention win

Jackson Shuttleworth · Group PM, Retention Team ▶ Watch

Gmail, YouTube, Microsoft

Google+ consumed millions of person-hours based on leadership conviction, failed, and was shut down in 2019

Gmail tabbed inbox was validated with a Wizard of Oz test (manually sorted emails behind a facade of HTML) before any real engineering

Itamar Gilad · Product coach and author ▶ Watch

The Synthesis

The tension between Shopify's "ban KPIs for product teams" and Duolingo's "600 experiments on one feature" dissolves once you see they are solving different problems at different stages of the product lifecycle.

Lifecycle Dependency

Why do Shopify and Duolingo's opposite approaches both work?

Measurement Timing

What is the biggest failure mode in experimentation?

Incrementalism Gravity

Can A/B testing actually kill transformative ideas?

For new product development and bold strategic bets, testing creates gravitational pull toward incrementalism -- the status quo almost always wins short-term. For optimization of proven products at scale, testing is unmatched. The tension dissolves once you see they solve different problems at different lifecycle stages.

The biggest failure mode is measuring experiments too soon. When 30-40% of short-term winners show no incremental impact a year later, the standard two-to-four-week experiment window is actively misleading. A change that looks like a 'loss' at two weeks might be a massive win at six months because it shifted user behavior in ways that compound.

If you A/B test a transformative new feature against the status quo, the status quo almost always wins short-term because users prefer the familiar. This is precisely why Shopify bans KPIs for core product teams. They need freedom to build toward a vision that might look worse in week-two metrics.

Which Approach Fits You?

Answer 3 questions about your situation. We'll match you to the right approach.

Question 1

What type of product decision are you making?

Question 2

How long do you typically run experiments before declaring results?

Question 3

Do you have enough traffic for statistically significant experiments?

Notable Absences

The Bottom Line

Lenny's newsletter on fostering experimentation culture at Airbnb reinforces this from the infrastructure side. Riley Newman, Airbnb's 10th employee and a data scientist, described how aligning around a single north star metric (nights booked) "propelled the company's use of experiments." And Rebecca Rosenfelt recalled experiment roundups where the audience voted on which variant won -- "it shocked everyone that so often, the intuitive variant was not the winner. Part of a data-driven culture is the humility about not knowing the right answer."

The non-obvious insight comes from Shopify's long-term holdouts: the biggest failure mode is not too much testing or too little. It is measuring experiments too soon. When 30-40% of short-term winners show no incremental impact a year later, the standard two-to-four-week experiment window is actively misleading. A change that looks like a "loss" at two weeks might be a massive win at six months because it shifted user behavior in ways that compound over time.

Sources

Ronny Kohavi — "The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)" — Lenny's Podcast, July 27, 2023
Archie Abrams — "Breaking the rules of growth: Why Shopify bans KPIs, optimizes for churn, prioritizes intuition, and builds toward a 100-year vision | Archie Abrams (VP Product, Head of Growth at Shopify)" — Lenny's Podcast, November 7, 2024
Jackson Shuttleworth — "Behind the product: Duolingo streaks | Jackson Shuttleworth (Group PM, Retention Team)" — Lenny's Podcast, December 15, 2024
Itamar Gilad — "Becoming evidence-guided | Itamar Gilad (Gmail, YouTube, Microsoft)" — Lenny's Podcast, September 21, 2023