ResearchAI AgentsMCPSEC Filings

Finance agents stall near ~50% on hard tasks. How much is the data layer?

Frontier finance agents stall near ~50% on hard, multi-step tasks. We held the model fixed and toggled a SEC-data MCP — the efficiency gap is the consistent signal, and you can watch a real run and click every number back to the 10-K.

min read

Anthropic's Opus 4.8 release and the vals.ai Finance-Agent benchmark tell the same story: on the harder, multi-step financial work in Finance-Agent v2, even frontier models top out near ~50%. So how much of that gap is the model — and how much is the data layer underneath it? We held the model fixed and toggled one variable — a SEC-data MCP — on and off. The gap between the two arms is the data layer's contribution.

It's mostly the data layer — and that's now a well-funded consensus, not our hunch. MetricDuck isn't another model; it's the data layer beneath whatever agent you run. What's different about MetricDuck is who it's for: it's free, self-serve, US-SEC-focused, and you can watch a real run and click every figure back to the filing before you connect. The enterprise data layers proving this category are paid, sales-gated, and built for institutions — MetricDuck is the one an individual builder can wire up in two clicks today.

Preview — placeholder data, not real results. Final numbers land with the multi-seed publication run.

Audit-grade SEC data for your AI agent — free, self-serve, every fact clickable to the source filing.

Same model, same questions — with MetricDuck the agent reaches the answer in roughly a third of the steps, at about half the cost. On the vals.ai Finance-Agent v2 set (n=50): with MetricDuck $0.26/query & 12 turns vs $0.55 & 40 without — accuracy delta +0.18 (final pending).

The headline isn't the accuracy number (that's noisier, and we finalize it on the full multi-seed run). It's the efficiency: even when both arms reach the same answer, the MetricDuck arm gets there in far fewer steps, at lower cost, with fewer dead-ends — because the data is pre-structured and every fact arrives cited to its source filing.

Don't take the number on faith — watch a real run → and click any figure back to the exact 10-K. That's the whole point: this is an auditable artifact, not a marketing chart.

Where this fits

The "a data layer makes finance agents better" case is settled — well-funded players are proving it, and we agree. The open question for a builder isn't whether to add a data layer, it's which one. The deep, human-verified, global options are enterprise-priced and sales-gated — the right call for an institution. MetricDuck is the complementary other end: fully automated, US-SEC-focused, free and self-serve, and verifiable in the open. We're not claiming to be deeper or more accurate than the institutional vendors; we're the one you can try yourself, right now, at no cost — and check our work on every number.

What we measured

MetricDuck is not a language model — it's an MCP server over SEC filings, transcripts, and IR disclosures. So a benchmark here measures a composed system, and we name every part: the agent LLM (Claude Sonnet 4.6), the agent harness, the MCP worker (pinned SHA), the MetricDuck API (pinned deploy), and the data snapshot (as-of date). Swap any part and it's a different system that must be re-measured — so our numbers are not directly comparable to a standalone-LLM score on the same benchmark. The honest comparison is the with-vs-without delta inside this stack.

We report two sets, never blended: our reproduction of the public vals.ai Finance-Agent v2 set (comparable, neutral) and a set of realistic equity-research workflows we authored (illustrative of where MetricDuck helps — not a neutral benchmark, and labeled as such).

How we keep it credible

  • Independent, cross-family judge. To rule out a model grading its own family favourably, we re-grade with a different-family judge (Google Gemini) on identical inputs and rubric; agreement was "almost perfect" (Cohen's κ disclosed on the benchmarks page). We don't publish when κ < 0.60.
  • Mechanical extraction + a setup-blind judge. The answer is pulled from the agent's committed response by rule (not re-interpreted by an LLM), and the judge never sees which arm produced it.
  • Reproducible. Every run records the harness SHA, MCP version, data snapshot, and seed, and we publish the harness so an outside party can re-derive a result.
  • Failure-first. We report the failure-category breakdown, not just the wins.

How to read this (the honest caveats)

  • Efficiency first, accuracy as support. The efficiency gap is the consistent signal; the accuracy delta is noisier across samples and is finalized only on the full multi-seed run.
  • Public-benchmark accuracy is a calibration line. Many questions concern facts the model already knows, so a tool can't change the answer — read the baseline calibration and the efficiency delta, not the absolute.
  • The workflow set is ours, by design — illustrative, not a neutral benchmark.
  • Self-run reproduction of the v2 methodology. This reproduces the vals.ai Finance-Agent v2 methodology (the current, harder version) — not their hosted leaderboard, so our scores aren't directly comparable to vals.ai's published numbers. The honest comparison is our with-vs-without delta inside one fixed stack.

See it for yourself: watch a real agent run · full method & numbers · connect MetricDuck to your agent (free tier, no card).

MetricDuck Research

SEC filing analysis and structured context for AI agents in finance