How MetricDuck performs, measured

We test whether an AI agent does financial research better with MetricDuck than without it — same model, same questions, with and without the MCP — and publish the methodology, the credibility checks, and the honest caveats alongside the numbers.

MetricDuck is the data layer beneath whatever agent you run — not another model. The numbers below are the same agent, with the MCP toggled on and off.

Results

Preview — placeholder data, not real results. Final numbers land with the multi-seed publication run.

On the harder Finance-Agent v2, even frontier models top out near ~50% — multi-step financial analysis is genuinely hard. See the vals.ai Finance-Agent benchmark — the independent benchmark we reproduce. Below is the same agent, with and without MetricDuck.

Public benchmark — vals.ai Finance-Agent v2 (50 questions)

Comparable, neutral set. Read the delta + the baseline calibration, not the absolute (see caveats).

ArmAccuracyCost / query*Turns (median)
Without MetricDuck (web + EDGAR)54%$0.5540
With MetricDuck72%$0.2612

Accuracy delta (with − without): +0.18, 95% CI [-0.009, +0.352] — spans 0 at this sample size (within noise).

Realistic equity-research workflows (17 workflows)

MetricDuck's own curated cross-corpus tasks (transcripts, 8-K, IR, FPI) — illustrative of value, not a neutral benchmark.

ArmAccuracyCost / query*Turns (median)
Without MetricDuck (web + EDGAR)71%$0.5431
With MetricDuck100%$0.2512

Accuracy delta (with − without): +0.29, 95% CI [+0.049, +0.531].

*Cost is list-price-equivalent (token cost at published rates) — a relative efficiency indicator. Numbers are reported per set and never blended.

Want to see it, not just the numbers? Watch a real agent run through MetricDuck → (every figure clicks back to the source filing).

How to read this vs other data vendors

The data layer makes finance agents better — a category now validated by well-funded players. The enterprise options are paid, sales-gated, and built for institutions. MetricDuck is the free, self-serve, US-SEC-focused one you can verify yourself before connecting.

What we measured

We reproduced the methodology of the vals.ai Finance-Agent v2 benchmark (the harder, multi-step version the current finance-agent discourse runs on) and ran the same agent two ways on identical questions: without MetricDuck (free web + SEC EDGAR tools, the vals-comparable baseline) and with the MetricDuck MCP. The difference between the two arms isolates MetricDuck's contribution.

Agent
Claude Sonnet 4.6 — same model, same prompts, both arms
Seeds
3 planned (1 complete) — modal verdict, not a single noisy run
Grading
Binary correct/incorrect on a committed final answer (vals-faithful)
Two sets
A public benchmark set (comparable) + realistic cross-corpus workflows (MetricDuck's value)

How we keep it credible

Independent, cross-family judge (κ = 0.918)

To rule out a model grading its own family favourably, we re-graded with a different-family judge (Google Gemini) on identical inputs and rubric. Agreement was "almost perfect" (Cohen's κ = 0.918) — no meaningful self-preference bias.

Mechanical extraction & a setup-blind judge

The answer is extracted from the agent's committed response by rule (not re-interpreted by an LLM), and the judge never sees which arm produced an answer.

Reproducible

Every run records the harness version, the MCP version, the data snapshot, and the seed, so a result can be re-derived.

How to read this (the honest caveats)

Read efficiency first; accuracy is the support

The most consistent signal is efficiency — fewer turns, lower cost, fewer dead-ends. The accuracy delta is noisier across samples; the headline accuracy figure is finalized only on the full multi-seed run.

Public-benchmark accuracy is a calibration line, not a lift claim

Many public-benchmark questions concern well-known facts the model already knows, so a tool can't change the answer. There, read the baseline calibration and the efficiency delta, not the absolute accuracy.

The workflow set is ours, by design

The realistic-workflow results come from tasks we authored to represent real analyst work. They're illustrative of where MetricDuck helps — not an independent, neutral benchmark, and we label them as such.

Self-run reproduction of the v2 methodology

This is our own reproduction of the vals.ai Finance-Agent v2 methodology (the current, harder version), not the official vals.ai leaderboard. Our scores aren't directly comparable to their published numbers — the apples-to-apples comparison is our with-MetricDuck arm vs our without arm, inside one fixed stack.