GOLDEN PATH

Record -> Replay -> Gate in CI

The canonical RunLedger workflow: record tool calls once, replay them in CI, and block regressions with hard contracts and budgets.

golden-path record-replay ci Updated 2026-01-23

Direct Answer

Record tool calls once in development, replay them deterministically in CI, and gate merges when outputs, contracts, or budgets regress.

Quick Decision

Use this Golden Path when Consider alternatives when
Your agent calls tools and you need deterministic CI gates. You only need qualitative scoring without merge gates.
You want stable, fast CI that never hits external APIs. You need live tool calls in CI every run.

The problem this solves

Tool-using agents regress silently. Live tool calls make CI slow and flaky. RunLedger replaces live calls with recorded cassettes so CI is deterministic and fast.

Core idea Treat tool outputs as deterministic fixtures, then gate PRs on contract and budget regressions.

Step 1: Record

Record a live run locally to capture tool calls and responses into a cassette.

bash
pipx install runledger
runledger init
runledger run ./evals/demo --mode record
Tip Record in a stable environment and redact secrets before committing cassettes.

Step 2: Replay

Promote a baseline from a known-good run, then replay deterministically in CI.

bash
runledger baseline promote --from runledger_out/demo/RUN_ID --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json

Step 3: Gate in CI

Add a CI job that replays cassettes and fails when outputs or budgets regress. This typically completes in under a few minutes.

yaml
name: agent-evals
on:
  pull_request:

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run deterministic evals (replay)
        run: runledger run ./evals/demo --mode replay --baseline baselines/demo.json

Expected outputs

  • `runledger_out///report.html` for a shareable summary.
  • `summary.json`, `junit.xml`, and `run.jsonl` for CI artifacts.
  • Non-zero exit codes when assertions or budgets fail.

Failure modes to watch

  • Cassette mismatch: tool call args changed or new calls added.
  • Schema assertion failures: missing required fields.
  • Budget regressions: latency, tool errors, or tool calls exceed caps.
  • Baseline regressions: success rate drops or p95 latency increases.

Demo: Local run

bash
pipx install runledger
runledger init
runledger run ./evals/demo --mode record
runledger baseline promote --from runledger_out/demo/RUN_ID --to baselines/demo.json
runledger run ./evals/demo --mode replay --baseline baselines/demo.json

To force a failure, change the agent output or modify a cassette entry and re-run replay.

Demo: GitHub Actions run

yaml
name: agent-evals
on:
  pull_request:

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -e ".[dev]"
      - run: runledger run ./evals/demo --mode replay --baseline baselines/demo.json

Tradeoffs

  • Requires maintaining cassettes and baselines alongside code.
  • Replayed outputs can drift from live systems if not refreshed.
  • Still pair with periodic live checks for production parity.

When NOT to use RunLedger

If your agent does not call external tools or you only need qualitative scoring, simpler unit tests or eval frameworks may be sufficient.