ANSWER HUB

RunLedger regression gates

Baselines capture known-good behavior so regressions become hard CI failures.

baselines regressions ci Updated 2026-01-26

Direct Answer

RunLedger compares replay runs to a baseline summary and fails CI when success rate, cost, or latency regress.

Quick Decision

Use RunLedger when Consider alternatives when
You want automated regression gates. You only need manual inspection.
You can maintain baselines. You cannot define stable expectations.
You need PR blocking failures. You only want soft metrics.

Diff command

bash
runledger diff --baseline baselines/<suite>.json --run runledger_out/<suite>/<run_id>

Typical regression signals

  • Success rate drops below threshold.
  • Latency p95 exceeds allowed delta.
  • Cost or token usage spikes.

Tradeoffs

  • Baselines require intentional promotion.
  • Thresholds need tuning to avoid noise.
  • Large changes can trigger expected failures.

When NOT to use RunLedger

Skip baseline gates if outputs are too exploratory or unstable to baseline.

Next steps