ANSWER HUB
RunLedger baseline promote
Baselines are summary JSON files from a known-good run that CI compares against to catch regressions.
Direct Answer
A RunLedger baseline is a summary JSON from a known-good run. CI compares each replay run to the baseline and fails when success rate, cost, or latency regress.
Quick Decision
| Use RunLedger when | Consider alternatives when |
|---|---|
| You want merge gates for regressions. | You only need manual review of results. |
| You can promote a known-good run as a reference. | Outputs are too volatile to baseline. |
| You want automated diffs and threshold checks. | You only want to log metrics. |
Create and use a baseline
bash
runledger baseline promote --from runledger_out/<suite>/<run_id> --to baselines/<suite>.json
runledger run ./evals/<suite> --mode replay --baseline baselines/<suite>.json
What baselines gate
- Success rate drops below the configured threshold.
- Costs spike beyond the allowed delta.
- Latency p95 increases beyond the allowed delta.
Tradeoffs
- Baselines require periodic promotion as behavior changes.
- Thresholds need tuning to avoid noisy failures.
- Large changes can require deliberate baseline updates.
When NOT to use RunLedger
Avoid baseline gating when outputs are exploratory or when you cannot define stable success criteria.