Frontier finance agents stall near ~50% on hard, multi-step tasks. We held the model fixed and toggled a SEC-data MCP — the efficiency gap is the consistent signal, and you can watch a real run and click every number back to the 10-K.

Anthropic's Opus 4.8 release and the vals.ai Finance-Agent benchmark tell the same story: on the harder, multi-step financial work in Finance-Agent v2, even frontier models top out near ~50%. So how much of that gap is the model — and how much is the data layer underneath it? We held the model fixed and toggled one variable — a SEC-data MCP — on and off. The gap between the two arms is the data layer's contribution.

It's mostly the data layer — and that's now a well-funded consensus, not our hunch. MetricDuck isn't another model; it's the data layer beneath whatever agent you run. What's different about MetricDuck is who it's for: it's free, self-serve, US-SEC-focused, and you can watch a real run and click every figure back to the filing before you connect. The enterprise data layers proving this category are paid, sales-gated, and built for institutions — MetricDuck is the one an individual builder can wire up in two clicks today.

⚠ Preview — placeholder data, not real results. Final numbers land with the multi-seed publication run.

Audit-grade SEC data for your AI agent — free, self-serve, every fact clickable to the source filing.

Same model, same questions — with MetricDuck the agent reaches the answer in roughly a third of the steps, at about half the cost. On the vals.ai Finance-Agent v2 set (n=50): with MetricDuck $0.26/query & 12 turns vs $0.55 & 40 without — accuracy delta +0.18 (final pending).

▶ Watch a real run Full method & numbers Connect it →

The headline isn't the accuracy number (that's noisier, and we finalize it on the full multi-seed run). It's the efficiency: even when both arms reach the same answer, the MetricDuck arm gets there in far fewer steps, at lower cost, with fewer dead-ends — because the data is pre-structured and every fact arrives cited to its source filing.

Don't take the number on faith — watch a real run → and click any figure back to the exact 10-K. That's the whole point: this is an auditable artifact, not a marketing chart.

Where this fits

The "a data layer makes finance agents better" case is settled — well-funded players are proving it, and we agree. The open question for a builder isn't whether to add a data layer, it's which one. The deep, human-verified, global options are enterprise-priced and sales-gated — the right call for an institution. MetricDuck is the complementary other end: fully automated, US-SEC-focused, free and self-serve, and verifiable in the open. We're not claiming to be deeper or more accurate than the institutional vendors; we're the one you can try yourself, right now, at no cost — and check our work on every number.

What we measured

MetricDuck is not a language model — it's an MCP server over SEC filings, transcripts, and IR disclosures. So a benchmark here measures a composed system, and we name every part: the agent LLM (Claude Sonnet 4.6), the agent harness, the MCP worker (pinned SHA), the MetricDuck API (pinned deploy), and the data snapshot (as-of date). Swap any part and it's a different system that must be re-measured — so our numbers are not directly comparable to a standalone-LLM score on the same benchmark. The honest comparison is the with-vs-without delta inside this stack.

We report two sets, never blended: our reproduction of the public vals.ai Finance-Agent v2 set (comparable, neutral) and a set of realistic equity-research workflows we authored (illustrative of where MetricDuck helps — not a neutral benchmark, and labeled as such).

How we keep it credible

Independent, cross-family judge. To rule out a model grading its own family favourably, we re-grade with a different-family judge (Google Gemini) on identical inputs and rubric; agreement was "almost perfect" (Cohen's κ disclosed on the benchmarks page). We don't publish when κ < 0.60.
Mechanical extraction + a setup-blind judge. The answer is pulled from the agent's committed response by rule (not re-interpreted by an LLM), and the judge never sees which arm produced it.
Reproducible. Every run records the harness SHA, MCP version, data snapshot, and seed, and we publish the harness so an outside party can re-derive a result.
Failure-first. We report the failure-category breakdown, not just the wins.

How to read this (the honest caveats)

Efficiency first, accuracy as support. The efficiency gap is the consistent signal; the accuracy delta is noisier across samples and is finalized only on the full multi-seed run.
Public-benchmark accuracy is a calibration line. Many questions concern facts the model already knows, so a tool can't change the answer — read the baseline calibration and the efficiency delta, not the absolute.
The workflow set is ours, by design — illustrative, not a neutral benchmark.
Self-run reproduction of the v2 methodology. This reproduces the vals.ai Finance-Agent v2 methodology (the current, harder version) — not their hosted leaderboard, so our scores aren't directly comparable to vals.ai's published numbers. The honest comparison is our with-vs-without delta inside one fixed stack.

See it for yourself: watch a real agent run · full method & numbers · connect MetricDuck to your agent (free tier, no card).

Finance agents stall near ~50% on hard tasks. How much is the data layer?

Where this fits

What we measured

How we keep it credible

How to read this (the honest caveats)

MetricDuck Research