5 guests 5 episodes 3,144 words

The Productivity Paradox: Why Every Dashboard Lies and Some Still Tell the Truth

Can you actually measure engineering productivity, or does every attempt ruin what you're trying to measure?

Nicole Forsgren turned developer productivity from vibes into peer-reviewed science. Laura Tacho and David Singleton built working systems on top of her research. Inbal Shani runs product for the platform that hosts most of the world's software. And Will Larson — a CTO who has sat across the table from boards demanding proof his engineers are worth their salaries — calls the entire exercise "probably one of the most common and also maybe the most annoying questions eng leaders get." All five are paid to have an answer. They do not agree on what it is.

On one side: rigorous, framework-based measurement is the only way out of arguing about productivity on gut feel. On the other: the moment you instrument the work, measurement starts eating the work itself. In the middle: a third position from GitHub's CPO, who has more raw developer telemetry than almost anyone alive and has concluded that the only metric that actually matters is whether the developer is happy. The real question is not whether engineering productivity can be measured. It is what happens to a team the moment you try.

Can you actually measure engineering productivity, or does every attempt — DORA dashboards, SPACE frameworks, Core 4 baskets, developer surveys — ruin the thing it is trying to describe? And if the numbers are diagnostic but not evaluative, what do you tell the board when they ask whether your engineers are productive enough?

The 3 Positions

1 / 3

Yes — Measure Rigorously, With Frameworks Built to Resist Their Own Abuse

Nicole Forsgren · David Singleton · Laura Tacho

Nicole Forsgren — Partner at Microsoft Research, formerly Google Cloud, GitHub, and DORA; co-author of Accelerate and the SPACE framework. Built the quantitative foundation of the entire field; her DORA research produced the peer-reviewed finding that quality and speed co-vary rather than trade off.

Forsgren's critique is not that measurement is the problem. It is that most leaders never define what they are trying to improve. She estimates four out of five leaders come back after months of effort with no shared definition of "developer experience" or "productivity" inside their own organization. Some mean inner-loop friction, others culture, others toolchains. That ambiguity — not instrumentation — is what ruins the exercise.

"Starting with what is your problem or what is your goal? I would say this is a bigger challenge than most people recognize or realize. 80% of the folks that I work with, this is their biggest problem."
Nicole Forsgren
▶ 00:00:00

Her prescription is a hierarchy. DORA — deployment frequency, lead time, change failure rate, time to restore — is the prescriptive "quick check" telling you where to focus. SPACE is the framework for picking balanced metrics across Satisfaction, Performance, Activity, Communication, and Efficiency. The discipline SPACE forces is tension: at least three dimensions at a time, so you do not over-optimize one thing and break another.

"SPACE is a way to measure, we say productivity, developer productivity, but it's a little bit more than that. SPACE is a good way to measure any type of complex creative work... SPACE gives you a framework to pick the right metrics."
Nicole Forsgren
▶ 00:29:31

"To use SPACE correctly, we want to use at least three dimensions at a time because that helps us balance... try to have them be in balance or intention so you don't throw something out of whack, but pick three."
Nicole Forsgren
▶ 00:32:11

She is emphatic that the single most common mistake — the one she still receives emails about every month — is reaching for activity metrics as if they were productivity itself.

"People have a hard time wrapping that around their heads because they kept picking things like number of lines of code, never pick number of lines of code... Number of pull requests, number of commits. And I was like, these are all activity metrics."
Nicole Forsgren
▶ 00:33:29

Laura Tacho — CTO at DX; co-author of the Core 4 framework with Abi Noda; runs DX's developer productivity metrics course with over 1,000 tech leaders enrolled. DX commercializes Core 4 with early adopters including Vercel, Intercom, and Dropbox.

Tacho built Core 4 to solve the implementation gap DORA, SPACE, and DevEx left behind. Her diagnosis: the frameworks were academically correct but individually insufficient — DORA too narrow, SPACE too abstract, DevEx too disconnected from business outcomes. So teams defaulted to activity counts, or nothing at all.

"In the past, most measurements focused on activity — like lines of code, number of commits, story points shipped — which don't tell a complete story about performance."
Laura Tacho

Core 4's resolution is four dimensions — Speed, Effectiveness, Quality, Impact — used as a system. The design principle is mutual tension: you cannot raise Speed without tanking Quality, or raise Impact without hurting Effectiveness. The basket forces balance in a way no single number can.

"The four dimensions of Core 4 are designed to hold each other in tension: we don't want to increase speed at the expense of developer experience, or spend more time on new features while quality takes a nosedive. These metrics are designed to be used together as a system to provide a balanced look at overall team performance."
Laura Tacho

Tacho is also emphatic that quantitative data alone is insufficient. Every Core 4 implementation pairs telemetry with anonymous developer experience surveys, because numbers can tell you what is happening but never why.

"Core 4 balances qualitative and quantitative measurements. Quantitative measurements provide insight into what is happening, but qualitative insights tell us why. Ultimately, we need to understand the 'why' of a team's behavior in order to change it, which is the whole point of measuring in the first place."
Laura Tacho

David Singleton — CTO at Stripe; previously engineering leadership at Google, including Android Wear. Stripe processes payments for roughly 1 in 10 people on the internet at 99.999% uptime.

Singleton is the pragmatic builder. At Stripe, developer productivity is not a dashboard — it is an internal product team with engineers as users, running the same product development process the company runs for its external customers.

"We really care about making Stripe a place where engineers can do some of the best work of their careers and that means being enabled with good tools and good developer productivity. So we invest a lot in developer productivity. We have a team who work on our dev tools and they actually run exactly the same product development process that I just described for our external users internally."
David Singleton
▶ 00:51:02

The mechanism is a rolling monthly survey on a statistical sample, triangulated against telemetry. Every engineer responds roughly once every six months — enough for signal, not enough for fatigue.

"We do a monthly survey. We operate in enough scale where we can ping enough people once a month that we get a statistically significant sample of the entire organization without having to get everyone to respond. Every individual respond once every six months. And we ask very directly, what's the experience you're having? And then we use that to prioritize the roadmap of our developer productivity team."
David Singleton
▶ 00:51:30

"We also measure everything so we know where people are getting stuck and when we compare both the kind of free responses and the data, it helps us invest in the best places to make everyone else more productive."
David Singleton
▶ 00:52:00

Singleton's regime is not about judging engineers. It is instrumentation for investment — finding friction so the team can remove it.

When This Applies

when you are a mid-to-large engineering organization with statistical power for rolling surveys and leadership discipline to read four dimensions in tension rather than collapsing to one. Stripe and DX's enterprise customers — Vercel, Intercom, Dropbox — are the canonical examples.

No — Measurement Is the Trap It Warns You About

Will Larson

Will Larson — CTO at Carta; formerly Stripe, Uber, Calm, Digg; author of An Elegant Puzzle and Staff Engineer. His writing at lethain.com is the de facto canon for engineering managers.

Larson is not a skeptic of the research — he praises Accelerate as one of the most influential books in engineering leadership. He is a skeptic of the dashboards built on top of it, speaking from the CTO seat where a board has asked him to prove his team is worth its headcount. Lenny frames the segment in terms Larson endorses throughout:

"An adjacent topic that I wanted to spend some time on is measuring engineering velocity productivity. I think it's probably one of the most common and also maybe the most annoying questions eng leaders get."
Lenny Rachitsky
▶ 00:48:24

His critique has two layers. First, DORA metrics are diagnostic, not evaluative. They tell you where to look for friction; they do not tell you whether the organization is good or bad. Shipping slowly might mean safety-critical code. Shipping fast might mean shipping garbage. The metric is a question, not an answer.

"There's at least 50 different startups out there that are selling you dashboards that instrument these pieces of data and they want you to just evaluate your team on them. The challenge is these are really good diagnosis metrics... your deployments being slow doesn't make you a good company or a bad company, it just tells you where you should focus on improving."
Will Larson
▶ 00:53:10

Second, Larson has personally lived the failure mode Forsgren warns against. At Stripe, his incident management team got so caught up in building perfect incident analytics that they forgot to actually reduce the number of incidents.

"We weren't actually prioritizing improvements, we were just prioritizing measurement. And you can't keep measuring. There's measure twice, cut once. Sure, but you don't measure infinite times and never get to cut."
Will Larson
▶ 00:10:25

Larson's alternative is narrative plus benchmarking: show the meaningful things shipped in the last six months, compare ratios against peer companies. He does use PR throughput — but only at the system level, across companies, never across individuals.

"While this is an extremely hard metric to evaluate in isolation, it's very easy to look at your organization's diffs per engineer against peer companies' diffs per engineer and make an informed judgment about whether things are going well for you."
Will Larson

For anything that actually matters — is this person a good engineer, is this team well-run — Larson thinks you have to look at the work.

When This Applies

when you are a smaller team that wants to know where friction lives without triggering a dashboard-driven management culture. Use the four metrics to find problems — then fix the problems, not the scoreboard.

The Only Metric That Matters Is Whether Developers Are Happy

Inbal Shani

Inbal Shani — CPO at GitHub; previously GM at AWS, leader at Microsoft, senior engineer on Amazon Robotics. Runs product for the platform that hosts most of the world's software and ships Copilot, the largest real-world experiment in AI developer productivity to date.

Shani sits on more data about what developers actually do every day than almost anyone in the industry. Her answer is the most deflationary on the spectrum:

"We are in a world that there are no right metrics. There is no one metric to rule them all. It's a combination of the things that you're looking to measure out of adopting AI."
Inbal Shani
▶ 00:20:35

When she evaluates whether Copilot is doing its job, she explicitly refuses the most obvious metric — time saved — as meaningless in isolation.

"The most easiest one is time, but time is, it's funny what I'm going to say, but time is not quantifiable as a success metrics because you can write really bad code really fast."
Inbal Shani
▶ 00:22:00

Instead, Shani collapses a basket of inputs — code quality, security catches, collaboration patterns, time-to-value on shipped features — into a single downstream outcome: developer happiness. A happy developer is one who is unblocked, shipping code that matters, not drowning in friction. If your metric regime does not improve how developers feel about their work, you are measuring the wrong things.

"There's developer productivity, developer collaboration, time gain, that if you're bringing them together are yielding the developer happiness. The biggest area of focus for us right now is the definition of productivity."
Inbal Shani
▶ 00:21:00

Shani sits between the other two camps. She agrees with Forsgren and Tacho that you need a basket rather than a single number, and with Larson that there is no clean, defensible, top-line productivity figure. Her resolution is to name the ultimate goal — happiness, or time to value — and treat everything else as leading indicators pointing toward it.

When This Applies

when measuring AI tooling, platform work, or anything where the obvious metric (time saved, code volume, adoption) is a trap. Shani's happiness-and-time-to-value frame is the cleanest example: name the business outcome before picking leading indicators.

Evidence from the Archive

Microsoft Research / DORA

Nicole Forsgren still receives monthly emails from leaders asking how to report lines of code as a productivity metric — the SPACE framework exists because of exactly this

Forsgren's research showed quality and speed co-vary: elite DORA performers ship smaller changes more often AND have lower change failure rates

Nicole Forsgren · Partner leading developer productivity research at Microsoft Research; co-author of Accelerate and SPACE ▶ 00:33:29

Stripe

Will Larson's Stripe incident-management team got so absorbed in analytics they forgot to actually reduce incidents — 'we weren't prioritizing improvements, we were prioritizing measurement'

His broader critique: DORA metrics are diagnostic (where to look) not evaluative (whether you're good), and the dashboard industry has commoditized them without preserving the caveats

Will Larson · CTO at Carta; ex-Stripe, Uber, Calm, Digg ▶ 00:10:25

GitHub

GitHub's Copilot is measured by code quality, security catches, and time-to-value — not time saved, because 'you can write really bad code really fast'

Inbal Shani sits on top of more developer telemetry than anyone and concludes there is no single metric to rule them all: developer happiness is the downstream outcome

Inbal Shani · Chief Product Officer at GitHub (Microsoft) ▶ 00:22:00

Stripe

Stripe runs a monthly rolling developer experience survey — every engineer responds once every six months — triangulated against system telemetry to set the internal dev-tools roadmap

David Singleton treats developer productivity as a product team problem: engineers are the users, surveys are qualitative discovery, instrumentation is quantitative

David Singleton · Chief Technology Officer at Stripe ▶ 00:51:30

At a 150-engineer org, a 3-point DXI improvement translated to ~100 hours/week recovered — equivalent to two full-time developers gained

Laura Tacho's Core 4 framework holds Speed, Effectiveness, Quality, and Impact in explicit tension so no single number can be gamed

Laura Tacho · CTO at DX; co-author of the Core 4 framework

The Synthesis

The debate is not actually about whether engineering productivity is measurable. All five voices concede the same underlying point: any honest answer is a basket, any single number will be gamed, activity counts are always wrong. The real argument is about what you do with the basket once you have it.

The 80% Definition Problem

Can your org name what 'productive' actually means here?

Diagnostic, Not Evaluative

Is this metric a question or an answer?

Tension Is the Safeguard

Are you holding at least three metrics in opposition?

Measurement Eats the Work

Are you measuring incidents or reducing them?

Name the Outcome First

What downstream result should this metric move?

Nicole Forsgren's finding: 80% of leaders she works with can't define what they're trying to improve — some mean inner-loop friction, others culture, others toolchains. That ambiguity, not instrumentation, is what ruins measurement. Before picking metrics, force the org to commit to which specific pain it's fixing.

Will Larson's key distinction: DORA metrics are diagnostic, not evaluative. Slow deployments don't make you bad — they tell you where to focus. The moment a board sees a dashboard, however, diagnostics become evaluation regardless of the caveats stapled to the edges. Larson watched it happen to his own team; the dashboard industry stripped Forsgren's warnings out on the way to shrink-wrap.

SPACE and Core 4 are both designed around mutual tension — Speed against Quality, Effectiveness against Impact — so you can't optimize one dimension without tanking another. A single number gets gamed. A basket of three or four, read as a system, forces balance. Measure one thing and you'll get exactly what you asked for, which is usually not what you wanted.

Larson's Stripe failure: his incident management team built perfect analytics and forgot to actually reduce incidents. Measure twice, cut once — but don't measure infinitely and never cut. Every measurement regime needs a tripwire for when the instrumentation starts eating the work it's supposed to improve.

Inbal Shani refuses 'time saved' as a Copilot success metric because you can write bad code fast. Pick the downstream outcome first — developer happiness, time to value — then treat everything else as leading indicators. This inverts the usual measurement mistake of collecting what's easy to instrument and hoping it ladders up to something meaningful.

Which Approach Fits You?

Answer 3 questions about your situation. We'll match you to the right approach.

Question 1

What is the size and maturity of your engineering org?

Question 2

What is your team's biggest risk from measurement?

Question 3

What role should the metrics play in your org?

Notable Absences

The Bottom Line

That is the non-obvious claim hiding under the argument. The question is not whether to measure. It is whether your organization has the discipline to read a basket of four numbers at once — and the judgment to know that when the basket stops telling you anything interesting, it is time to look at the work itself.

The deeper insight under all five voices is that measurement is a leadership competence, not a tooling problem. Forsgren's 80% statistic — most leaders cannot define what they are trying to improve — is the same failure Larson describes when his team confused "analyzing incidents" with "reducing incidents." It is the same failure Shani resists when she refuses to let Copilot be judged on time saved. The frameworks are designed to protect leaders from themselves. Lenny's own Core 4 newsletter — and the Ramp-Forsgren quality-speed correlation Lenny cites in his Geoff Charles interview — point to what measurement done well buys you: the ability to prove shipping faster and shipping better are the same thing, done by the same teams, for the same reasons.

Sources

Nicole Forsgren — "How to measure AI developer productivity in 2025 | Nicole Forsgren" — Lenny's Podcast, October 19, 2025
Will Larson — "The engineering mindset | Will Larson (Carta, Stripe, Uber, Calm, Digg)" — Lenny's Podcast, January 7, 2024
Inbal Shani — "The future of AI in software development | Inbal Shani (CPO of GitHub)" — Lenny's Podcast, December 1, 2023
David Singleton — "Building a culture of excellence | David Singleton (CTO of Stripe)" — Lenny's Podcast, May 4, 2023