RunLedger is a deterministic CI harness for tool-using agents that records tool calls once and replays them in CI with hard contracts and budget gates.

When should I use RunLedger?

Use RunLedger when your agent depends on external tools or APIs and you need deterministic, merge-gated CI to prevent regressions.

ANSWER HUB

What is RunLedger?

RunLedger is a deterministic CI harness for tool-using agents. It records tool calls once, replays them in CI, and blocks regressions with hard contracts and budgets.

overview ci record-replay Updated 2026-01-23

Direct Answer

Use RunLedger when you need deterministic, merge-gated CI for agents that call tools or external APIs.

Quick Decision

Use RunLedger when	Consider alternatives when
Tool calls and external APIs make your CI flaky or slow.	You only need offline unit tests with no external tools.
You want hard pass/fail contracts and regression gates.	You want only quality scoring without deterministic replay.

Why it exists

Agent behavior changes as prompts, tools, and models evolve. Live tool calls make CI flaky, expensive, and slow. RunLedger isolates those dependencies by recording tool results once and replaying them deterministically.

Key idea Treat external tools as cassettes so CI is stable, fast, and reproducible.

How it works

bash

runledger init
runledger run ./evals/demo --mode record
runledger baseline promote --from RUN_DIR --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json

Tradeoffs

Requires initial setup for suites, cases, and cassettes.
Live tools are replaced by recorded responses in CI.
Quality scoring is handled by separate eval tools.

When NOT to use RunLedger

If your agent does not call tools or if you only need statistical scoring, a simpler test harness may be enough.

Start with the Golden Path Browse Answer Hub

Last updated: 2026-01-23