The Productivity Paradox: Why Every Dashboard Lies and Some Still Tell the Truth
Can you actually measure engineering productivity, or does every attempt ruin what you're trying to measure?
Nicole Forsgren turned developer productivity from vibes into peer-reviewed science. Laura Tacho and David Singleton built working systems on top of her research. Inbal Shani runs product for the platform that hosts most of the world's software. And Will Larson — a CTO who has sat across the table from boards demanding proof his engineers are worth their salaries — calls the entire exercise "probably one of the most common and also maybe the most annoying questions eng leaders get." All five are paid to have an answer. They do not agree on what it is.
On one side: rigorous, framework-based measurement is the only way out of arguing about productivity on gut feel. On the other: the moment you instrument the work, measurement starts eating the work itself. In the middle: a third position from GitHub's CPO, who has more raw developer telemetry than almost anyone alive and has concluded that the only metric that actually matters is whether the developer is happy. The real question is not whether engineering productivity can be measured. It is what happens to a team the moment you try.
Can you actually measure engineering productivity, or does every attempt — DORA dashboards, SPACE frameworks, Core 4 baskets, developer surveys — ruin the thing it is trying to describe? And if the numbers are diagnostic but not evaluative, what do you tell the board when they ask whether your engineers are productive enough?
The 3 Positions
Evidence from the Archive
Nicole Forsgren still receives monthly emails from leaders asking how to report lines of code as a productivity metric — the SPACE framework exists because of exactly this
Forsgren's research showed quality and speed co-vary: elite DORA performers ship smaller changes more often AND have lower change failure rates
Will Larson's Stripe incident-management team got so absorbed in analytics they forgot to actually reduce incidents — 'we weren't prioritizing improvements, we were prioritizing measurement'
His broader critique: DORA metrics are diagnostic (where to look) not evaluative (whether you're good), and the dashboard industry has commoditized them without preserving the caveats
GitHub's Copilot is measured by code quality, security catches, and time-to-value — not time saved, because 'you can write really bad code really fast'
Inbal Shani sits on top of more developer telemetry than anyone and concludes there is no single metric to rule them all: developer happiness is the downstream outcome
Stripe runs a monthly rolling developer experience survey — every engineer responds once every six months — triangulated against system telemetry to set the internal dev-tools roadmap
David Singleton treats developer productivity as a product team problem: engineers are the users, surveys are qualitative discovery, instrumentation is quantitative
At a 150-engineer org, a 3-point DXI improvement translated to ~100 hours/week recovered — equivalent to two full-time developers gained
Laura Tacho's Core 4 framework holds Speed, Effectiveness, Quality, and Impact in explicit tension so no single number can be gamed
The Synthesis
The debate is not actually about whether engineering productivity is measurable. All five voices concede the same underlying point: any honest answer is a basket, any single number will be gamed, activity counts are always wrong. The real argument is about what you do with the basket once you have it.
Nicole Forsgren's finding: 80% of leaders she works with can't define what they're trying to improve — some mean inner-loop friction, others culture, others toolchains. That ambiguity, not instrumentation, is what ruins measurement. Before picking metrics, force the org to commit to which specific pain it's fixing.
Will Larson's key distinction: DORA metrics are diagnostic, not evaluative. Slow deployments don't make you bad — they tell you where to focus. The moment a board sees a dashboard, however, diagnostics become evaluation regardless of the caveats stapled to the edges. Larson watched it happen to his own team; the dashboard industry stripped Forsgren's warnings out on the way to shrink-wrap.
SPACE and Core 4 are both designed around mutual tension — Speed against Quality, Effectiveness against Impact — so you can't optimize one dimension without tanking another. A single number gets gamed. A basket of three or four, read as a system, forces balance. Measure one thing and you'll get exactly what you asked for, which is usually not what you wanted.
Larson's Stripe failure: his incident management team built perfect analytics and forgot to actually reduce incidents. Measure twice, cut once — but don't measure infinitely and never cut. Every measurement regime needs a tripwire for when the instrumentation starts eating the work it's supposed to improve.
Inbal Shani refuses 'time saved' as a Copilot success metric because you can write bad code fast. Pick the downstream outcome first — developer happiness, time to value — then treat everything else as leading indicators. This inverts the usual measurement mistake of collecting what's easy to instrument and hoping it ladders up to something meaningful.
Which Approach Fits You?
Answer 3 questions about your situation. We'll match you to the right approach.
What is the size and maturity of your engineering org?
What is your team's biggest risk from measurement?
What role should the metrics play in your org?
Notable Absences
The Bottom Line
That is the non-obvious claim hiding under the argument. The question is not whether to measure. It is whether your organization has the discipline to read a basket of four numbers at once — and the judgment to know that when the basket stops telling you anything interesting, it is time to look at the work itself.
The deeper insight under all five voices is that measurement is a leadership competence, not a tooling problem. Forsgren's 80% statistic — most leaders cannot define what they are trying to improve — is the same failure Larson describes when his team confused "analyzing incidents" with "reducing incidents." It is the same failure Shani resists when she refuses to let Copilot be judged on time saved. The frameworks are designed to protect leaders from themselves. Lenny's own Core 4 newsletter — and the Ramp-Forsgren quality-speed correlation Lenny cites in his Geoff Charles interview — point to what measurement done well buys you: the ability to prove shipping faster and shipping better are the same thing, done by the same teams, for the same reasons.
Sources
- Nicole Forsgren — "How to measure AI developer productivity in 2025 | Nicole Forsgren" — Lenny's Podcast, October 19, 2025
- Will Larson — "The engineering mindset | Will Larson (Carta, Stripe, Uber, Calm, Digg)" — Lenny's Podcast, January 7, 2024
- Inbal Shani — "The future of AI in software development | Inbal Shani (CPO of GitHub)" — Lenny's Podcast, December 1, 2023
- David Singleton — "Building a culture of excellence | David Singleton (CTO of Stripe)" — Lenny's Podcast, May 4, 2023