

Three Ways to Measure Building AI Agents
Most agentic AI company I meet are tempted by one of three extremes:
Each force is useful and each in isolation, is dangerous. Finding the balance to all three is a challenge.
1) Lead with taste, but instrument ruthlessly
Opinionated product design is how agents avoid paralysis. Set your tenets (“ask less, do more,” “one‑shot over chatty,” “cite state by default”), then make the agent act with confidence. But instrument everything: Task Success Rate, First‑Pass Resolution, and Intervention Rate will tell you whether your taste is actually saving users time, not just polishing demos.
North‑star: “Minutes of drudgery removed per user per week.”
2) Listen to customers, not just requests
I’ve never seen A/B tests and surveys create a breakthrough agent on their own. Use them to reveal friction and failure modes, not to crowdsource your roadmap. The rule of thumb: requests become experiments, not commitments. When you do ship a request, tie it to the core loop: does it raise FPR or lower Time‑to‑Resolution?
Customer retention is still a gold standard. How much do your users use the product on a daily or weekly basis? How long are the sessions? Essentially understanding how big of a % your customers spend of their workday or free time is a good indicator of PMF.
3) Respect benchmarks, don’t worship them
Benchmarks are an interesting new milestone to build towards. Evals keep you on track. They can be valuable to keep you honest and see how you compare but often misleading. There many practical ones:
But remember: leaderboards are a means, not a market. If your users aren’t sticking, a better score won’t fix it. Also, the fastest way to tank product quality is to optimize solely for a single metric a benchmark rewards.
The three common traps
Finding a balance between these three is tricky but important.