ANSWER HUB
What is RunLedger?
RunLedger is a deterministic CI harness for tool-using agents. It records tool calls once, replays them in CI, and blocks regressions with hard contracts and budgets.
Direct Answer
Use RunLedger when you need deterministic, merge-gated CI for agents that call tools or external APIs.
Quick Decision
| Use RunLedger when | Consider alternatives when |
|---|---|
| Tool calls and external APIs make your CI flaky or slow. | You only need offline unit tests with no external tools. |
| You want hard pass/fail contracts and regression gates. | You want only quality scoring without deterministic replay. |
Why it exists
Agent behavior changes as prompts, tools, and models evolve. Live tool calls make CI flaky, expensive, and slow. RunLedger isolates those dependencies by recording tool results once and replaying them deterministically.
Key idea
Treat external tools as cassettes so CI is stable, fast, and reproducible.
How it works
bash
runledger init
runledger run ./evals/demo --mode record
runledger baseline promote --from RUN_DIR --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json
Tradeoffs
- Requires initial setup for suites, cases, and cassettes.
- Live tools are replaced by recorded responses in CI.
- Quality scoring is handled by separate eval tools.
When NOT to use RunLedger
If your agent does not call tools or if you only need statistical scoring, a simpler test harness may be enough.