Insight
-
October 22, 2025

How should founders measure their AI products quality?

by
Jesse Heikkila

Three Ways to Measure Building AI Agents

Most agentic AI company I meet are tempted by one of three extremes:

  • The Tastemaker Urge: “We know what ‘right’ feels like. Ship it.”
  • The Customer Chorus: “Give them what they ask for.”
  • The Leaderboard Fixation: “If we top the benchmarks, users will come.”

Each force is useful and each in isolation, is dangerous. Finding the balance to all three is a challenge.

1) Lead with taste, but instrument ruthlessly

Opinionated product design is how agents avoid paralysis. Set your tenets (“ask less, do more,” “one‑shot over chatty,” “cite state by default”), then make the agent act with confidence. But instrument everything: Task Success Rate, First‑Pass Resolution, and Intervention Rate will tell you whether your taste is actually saving users time, not just polishing demos.

North‑star: “Minutes of drudgery removed per user per week.”

2) Listen to customers, not just requests

I’ve never seen A/B tests and surveys create a breakthrough agent on their own. Use them to reveal friction and failure modes, not to crowdsource your roadmap. The rule of thumb: requests become experiments, not commitments. When you do ship a request, tie it to the core loop: does it raise FPR or lower Time‑to‑Resolution?

Customer retention is still a gold standard. How much do your users use the product on a daily or weekly basis? How long are the sessions? Essentially understanding how big of a % your customers spend of their workday or free time is a good indicator of PMF.

3) Respect benchmarks, don’t worship them

Benchmarks are an interesting new milestone to build towards. Evals keep you on track. They can be valuable to keep you honest and see how you compare but often misleading. There many practical ones:

  • Building dev agents? SWE‑bench tests end‑to‑end issue resolution in real repos.
  • Generalist reasoning? Humanity’s Last Exam pushes on frontier‑level, multi‑modal knowledge and calibration. Useful as a trendline, not a feature list. arXiv+1
  • Generating product UIs? UI‑Bench evaluates design quality with blinded expert comparisons. Helpful to prevent regression in the “wow, this looks right” department. arXiv

But remember: leaderboards are a means, not a market. If your users aren’t sticking, a better score won’t fix it. Also, the fastest way to tank product quality is to optimize solely for a single metric a benchmark rewards.

The three common traps

  1. Benchmark Blindness: You soar on an eval nobody your users care about. Symptom: good rankings, good churn.
  2. Feature Factory: You chase requests that shave seconds while ignoring the step‑function. Symptom: growing menus, growing complexity, shrinking delight.
  3. Aesthetic Absolutism: You protect a beautiful flow that’s slower than a blunt prompt. Symptom: Public praise, silent churn.

Finding a balance between these three is tricky but important.