"Invest in evals and reliability for your core use case — speed without measurement is waste"
Evidence from the Archive
Nvidia / Stanford
Counter-example company where senior engineers were most resistant to AI coding tools due to quality standards
Three-bucket engineering experiment: company gave half of each performance tier Cursor access, found top performers gained most
Core developer on NVIDIA's NeMo platform, AI researcher at Netflix, Stanford ML instructor, two-time founder, and author of the most-read book on the O'Reilly platform since its launch. Works with enterprises on AI strategy and sees contradictory ground-truth data across companies. Their core argument: AI changes the entire engineering org structure — but productivity gains are harder to measure than people think, and talking to users still matters more than chasing AI news.
The evidence is specific: Three-bucket engineering experiment: company gave half of each performance tier Cursor access, found top performers gained most. Furthermore, counter-example company where senior engineers were most resistant to AI coding tools due to quality standards. Her viral LinkedIn post comparing 'what people think improves AI apps' vs. 'what actually does' — talking to users beats chasing AI news.
In Chip Huyen's own words: "The biggest performance boost was the senior engineer, the highest performing. So the highest performing engineers is also normal practice. They also know how to solve problems. So they have some solved problem better. Whereas the people who have the lowest performing, they only don't care much about work." (Describing a three-bucket engineering study where top performers gained most from AI tools.)
Nvidia / Stanford
Chip's viral chart contrasting what people think improves AI apps vs. what actually does
Companies rationally choosing to build a new feature rather than improve an existing feature from 80% to 82% accuracy via evals
As a leading AI engineering educator who works with hundreds of companies building AI products, and whose viral chart on what actually improves AI apps became a touchstone for the industry, Huyen brings breadth of real-world observation that few individual practitioners can match. Their core argument: Invest in evals and reliability for your core use case -- speed without measurement is waste.
The evidence is specific: Companies rationally choosing to build a new feature rather than improve an existing feature from 80% to 82% accuracy via evals. Furthermore, products with hundreds of different eval metrics covering verbosity, user-sensitive data, length, and more. Chip's viral chart contrasting what people think improves AI apps vs. what actually does.
In Chip Huyen's own words: "I don't think it's how many eval should I get, but how many eval do I need to get a good coverage, a high confidence in my application's performance and also to help me understand where it is not performing well so that I can fix it." (On the right level of testing for AI products.)