Every drafting system needs a number that tells you whether the prose is grounded in the sources. We had one. It said zero. It was lying — not about the drafts, but about itself.
The metric that looked broken
Our first grounding signal was a citation count: for each source, count it as "cited" when the draft reproduces two or more of its specific numbers. It's cheap, deterministic, and on a results table it works fine — the numbers either reappear or they don't.
On a narrative section it reads ≈ 0 almost by construction. A well-written Introduction or Benefit-Risk discussion paraphrases, synthesizes, and contextualizes. It rarely parrots two literal figures from any single source. So our dashboard showed "citations ≈ 0" on exactly the sections that were, in fact, carefully grounded.
We spent real effort chasing that zero — a staircase of single-variable experiments across retrieval and the drafting contract. The number didn't move off ≈0 in any of them. Worse, one "fix" actively backfired: a strongly-worded "cite everything" contract pushed the model into refusing to write sentences it couldn't pin to a literal number — one section collapsed from ~1,000 words to under 300 and filled with [NOT FOUND] stubs. We reverted it. We were tuning the writing to satisfy a broken gauge.
The lesson arrived late and clearly: the metric was a vanity proxy. It measured numeric overlap, not grounding. We were optimizing the thermometer.
What grounding actually means
Grounding is an entailment question: is each factual claim in the draft supported by the retrieved sources? That's not something you count — it's something you judge, one claim at a time, against the evidence the agent actually saw.
So we built that. For each drafted section we re-fetch the exact sources the agent read, extract every distinct factual claim, and have a rubric-driven judge rate each one supported / partial / unsupported. The score is the supported fraction, and the unsupported claims become a concrete list of things to look at — candidate hallucinations, not a mystery number.
The first time we ran it, the "ungrounded" narrative modules scored 0.83 to 0.97 — a Discussion section at 0.97, a Synopsis at 0.83. They had been grounded all along; the proxy simply couldn't see it. (The honest number, document-wide, was around 0.69 on that early build — a deliberately conservative floor, since the judge scored a claim "partial" the moment any sub-fact wasn't fully supported — and it climbed to 0.81 once we also let the judge see source tables, a separate story. The point of the number is the direction it moves under scrutiny, not the decimal.)
Why this matters beyond one metric
Two things generalized from this.
First: pick the metric that matches the artifact. A numeric-overlap check is a fine secondary signal for results tables, where the deliverable really is the numbers. It is the wrong primary KPI for narrative, where the deliverable is faithful synthesis. We now treat the numeric count as a quantitative flag, not the grounding measure.
Second: a bad metric is worse than no metric. A missing number invites investigation. A confident, wrong number invites action — and we took it, twice, before realizing the alarm itself was miscalibrated. In a regulated-writing system the cost of optimizing a vanity metric isn't just wasted effort; it's the risk of "improving" the product in the wrong direction with full conviction.
The discipline we kept
The fix wasn't a cleverer count. It was changing the question from "how many numbers reappear?" to "is this claim entailed by a source?" — and accepting that the honest version of grounding requires a judge, with all the care that implies (stable prompts, repeat-checks, and eventually multiple judges).
When a quality number surprises you, the first hypothesis should be that the number is wrong about itself. Confirm the metric measures what its name claims before you trust it enough to steer by it. Ours didn't, and the most useful thing we built that week wasn't a better draft — it was a grounding measure we could actually believe.