Record -> Replay -> Gate in CI
The canonical RunLedger workflow: record tool calls once, replay them in CI, and block regressions with hard contracts and budgets.
Direct Answer
Record tool calls once in development, replay them deterministically in CI, and gate merges when outputs, contracts, or budgets regress.
Quick Decision
| Use this Golden Path when | Consider alternatives when |
|---|---|
| Your agent calls tools and you need deterministic CI gates. | You only need qualitative scoring without merge gates. |
| You want stable, fast CI that never hits external APIs. | You need live tool calls in CI every run. |
The problem this solves
Tool-using agents regress silently. Live tool calls make CI slow and flaky. RunLedger replaces live calls with recorded cassettes so CI is deterministic and fast.
Step 1: Record
Record a live run locally to capture tool calls and responses into a cassette.
pipx install runledger
runledger init
runledger run ./evals/demo --mode record
Step 2: Replay
Promote a baseline from a known-good run, then replay deterministically in CI.
runledger baseline promote --from runledger_out/demo/RUN_ID --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json
Step 3: Gate in CI
Add a CI job that replays cassettes and fails when outputs or budgets regress. This typically completes in under a few minutes.
name: agent-evals
on:
pull_request:
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run deterministic evals (replay)
run: runledger run ./evals/demo --mode replay --baseline baselines/demo.json
Expected outputs
- `runledger_out/
/ /report.html` for a shareable summary. - `summary.json`, `junit.xml`, and `run.jsonl` for CI artifacts.
- Non-zero exit codes when assertions or budgets fail.
Failure modes to watch
- Cassette mismatch: tool call args changed or new calls added.
- Schema assertion failures: missing required fields.
- Budget regressions: latency, tool errors, or tool calls exceed caps.
- Baseline regressions: success rate drops or p95 latency increases.
Demo: Local run
pipx install runledger
runledger init
runledger run ./evals/demo --mode record
runledger baseline promote --from runledger_out/demo/RUN_ID --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json
To force a failure, change the agent output or modify a cassette entry and re-run replay.
Demo: GitHub Actions run
name: agent-evals
on:
pull_request:
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -e ".[dev]"
- run: runledger run ./evals/demo --mode replay --baseline baselines/demo.json
Tradeoffs
- Requires maintaining cassettes and baselines alongside code.
- Replayed outputs can drift from live systems if not refreshed.
- Still pair with periodic live checks for production parity.
When NOT to use RunLedger
If your agent does not call external tools or you only need qualitative scoring, simpler unit tests or eval frameworks may be sufficient.