We test whether an AI agent does financial research better with MetricDuck than without it — same model, same questions, with and without the MCP — and publish the methodology, the credibility checks, and the honest caveats alongside the numbers.
MetricDuck is the data layer beneath whatever agent you run — not another model. The numbers below are the same agent, with the MCP toggled on and off.
⚠ Preview — placeholder data, not real results. Final numbers land with the multi-seed publication run.
On the harder Finance-Agent v2, even frontier models top out near ~50% — multi-step financial analysis is genuinely hard. See the vals.ai Finance-Agent benchmark ↗ — the independent benchmark we reproduce. Below is the same agent, with and without MetricDuck.
Comparable, neutral set. Read the delta + the baseline calibration, not the absolute (see caveats).
| Arm | Accuracy | Cost / query* | Turns (median) |
|---|---|---|---|
| Without MetricDuck (web + EDGAR) | 54% | $0.55 | 40 |
| With MetricDuck | 72% | $0.26 | 12 |
Accuracy delta (with − without): +0.18, 95% CI [-0.009, +0.352] — spans 0 at this sample size (within noise).
MetricDuck's own curated cross-corpus tasks (transcripts, 8-K, IR, FPI) — illustrative of value, not a neutral benchmark.
| Arm | Accuracy | Cost / query* | Turns (median) |
|---|---|---|---|
| Without MetricDuck (web + EDGAR) | 71% | $0.54 | 31 |
| With MetricDuck | 100% | $0.25 | 12 |
Accuracy delta (with − without): +0.29, 95% CI [+0.049, +0.531].
*Cost is list-price-equivalent (token cost at published rates) — a relative efficiency indicator. Numbers are reported per set and never blended.
Want to see it, not just the numbers? Watch a real agent run through MetricDuck → (every figure clicks back to the source filing).
The data layer makes finance agents better — a category now validated by well-funded players. The enterprise options are paid, sales-gated, and built for institutions. MetricDuck is the free, self-serve, US-SEC-focused one you can verify yourself before connecting.
We reproduced the methodology of the vals.ai Finance-Agent v2 benchmark (the harder, multi-step version the current finance-agent discourse runs on) and ran the same agent two ways on identical questions: without MetricDuck (free web + SEC EDGAR tools, the vals-comparable baseline) and with the MetricDuck MCP. The difference between the two arms isolates MetricDuck's contribution.
To rule out a model grading its own family favourably, we re-graded with a different-family judge (Google Gemini) on identical inputs and rubric. Agreement was "almost perfect" (Cohen's κ = 0.918) — no meaningful self-preference bias.
The answer is extracted from the agent's committed response by rule (not re-interpreted by an LLM), and the judge never sees which arm produced an answer.
Every run records the harness version, the MCP version, the data snapshot, and the seed, so a result can be re-derived.
The most consistent signal is efficiency — fewer turns, lower cost, fewer dead-ends. The accuracy delta is noisier across samples; the headline accuracy figure is finalized only on the full multi-seed run.
Many public-benchmark questions concern well-known facts the model already knows, so a tool can't change the answer. There, read the baseline calibration and the efficiency delta, not the absolute accuracy.
The realistic-workflow results come from tasks we authored to represent real analyst work. They're illustrative of where MetricDuck helps — not an independent, neutral benchmark, and we label them as such.
This is our own reproduction of the vals.ai Finance-Agent v2 methodology (the current, harder version), not the official vals.ai leaderboard. Our scores aren't directly comparable to their published numbers — the apples-to-apples comparison is our with-MetricDuck arm vs our without arm, inside one fixed stack.