Ship deterministic agent CI in hours, not weeks.
This guide covers the CI harness workflow: define suites, record tool calls, promote baselines, replay in CI, and gate merges with hard contracts.
RunLedger is a CI harness, not an eval metrics framework. Keep DeepEval for scoring and use RunLedger for deterministic merge gates.
pipx install runledger runledger init runledger run ./evals --mode record runledger baseline promote --from RUN_DIR --to baselines/suite.json runledger run ./evals --mode replay open runledger_out/**/report.html
Quickstart
See all commandsInstall
Use pipx for a clean global install and isolated environments.
pipx install runledger
Record
Scaffold a demo suite with runledger init, then record a live run.
runledger run ./evals --mode record
Promote baseline
Promote a known-good run (use the RUN_DIR printed after record) so CI can gate regressions.
runledger baseline promote --from RUN_DIR --to baselines/suite.json
Replay in CI
Run deterministic evals and gate merges in CI.
runledger run ./evals --mode replay
CI Templates
Template docsname: runledger-ci
on:
pull_request:
workflow_dispatch:
env:
RUNLEDGER_PATH: ./evals/demo
RUNLEDGER_MODE: replay
RUNLEDGER_BASELINE: baselines/demo.json
jobs:
runledger:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: python -m pip install runledger
- run: runledger run $RUNLEDGER_PATH --mode $RUNLEDGER_MODE --baseline $RUNLEDGER_BASELINE
RunLedger vs DeepEval
DeepEval (Confident AI)
Evaluation metrics and model-graded scoring for quality and benchmarking.
- Faithfulness, relevancy, and other scoring suites
- Benchmark workflows and dataset management
- Optional hosted platform for evaluations
RunLedger
Deterministic CI for tool-using agents with replay, contracts, and PR gates.
- Record tool calls once and replay in CI
- Hard contracts for schema, tool order, and budgets
- Baselines and regression gates with JUnit and HTML
Use both: keep DeepEval scoring and run it inside RunLedger's deterministic harness.
Core Concepts
Suites
A suite bundles cases, tool registry, contracts, and budgets into a single CI unit.
Cases
Each case defines a task input and a cassette for deterministic replay.
Cassettes
Record tool inputs and outputs once, then reuse them in CI.
Assertions
Hard contracts for JSON schema, required fields, and tool order.
Budgets
Enforce hard caps on latency, tool calls, and error rates.
Baselines
Promote a known-good run and gate PRs on regressions.
Assertions and budgets are hard gates.
Use JSON Schema, required fields, and tool-order checks for deterministic contracts, then layer budgets for latency and tool usage.
- JSON schema validation for final output
- Required fields, regex, and tool-order checks
- Budget caps for wall time and tool calls
assertions:
- type: json_schema
schema_path: schema.json
- type: required_fields
fields: ["category", "reply"]
budgets:
max_wall_ms: 20000
max_tool_calls: 10
max_tool_errors: 0
Artifacts and reporting
Protocol detailsRun logs
Every event is captured to JSONL for auditing and diffs.
run.jsonl
CI output
JUnit and summary JSON integrate directly with CI dashboards.
junit.xml
Shareable report
A static HTML report that opens anywhere, no server needed.
report.html
Summary metrics
Use summary.json for baseline diffs and regression gates.
summary.json
Ready for the CLI deep dive?
Explore every command, config field, and protocol message.
Want a faster path to CI hardening?
Book a fixed-scope Hardening Sprint or ongoing Assurance.