DOCUMENTATION

Ship deterministic agent CI in hours, not weeks.

This guide covers the CI harness workflow: define suites, record tool calls, promote baselines, replay in CI, and gate merges with hard contracts.

RunLedger is a CI harness, not an eval metrics framework. Keep DeepEval for scoring and use RunLedger for deterministic merge gates.

Replay-first CI
Tool call cassettes
Hard assertions
quickstart.sh
pipx install runledger
runledger init
runledger run ./evals --mode record
runledger baseline promote --from RUN_DIR --to baselines/suite.json
runledger run ./evals --mode replay
open runledger_out/**/report.html
record once promote baseline replay in CI

Quickstart

See all commands

Install

Use pipx for a clean global install and isolated environments.

pipx install runledger

Record

Scaffold a demo suite with runledger init, then record a live run.

runledger run ./evals --mode record

Promote baseline

Promote a known-good run (use the RUN_DIR printed after record) so CI can gate regressions.

runledger baseline promote --from RUN_DIR --to baselines/suite.json

Replay in CI

Run deterministic evals and gate merges in CI.

runledger run ./evals --mode replay

CI Templates

Template docs
github-actions-runledger.yml
name: runledger-ci
on:
  pull_request:
  workflow_dispatch:

env:
  RUNLEDGER_PATH: ./evals/demo
  RUNLEDGER_MODE: replay
  RUNLEDGER_BASELINE: baselines/demo.json

jobs:
  runledger:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: python -m pip install runledger
      - run: runledger run $RUNLEDGER_PATH --mode $RUNLEDGER_MODE --baseline $RUNLEDGER_BASELINE

RunLedger vs DeepEval

EVAL METRICS

DeepEval (Confident AI)

Evaluation metrics and model-graded scoring for quality and benchmarking.

  • Faithfulness, relevancy, and other scoring suites
  • Benchmark workflows and dataset management
  • Optional hosted platform for evaluations
CI HARNESS

RunLedger

Deterministic CI for tool-using agents with replay, contracts, and PR gates.

  • Record tool calls once and replay in CI
  • Hard contracts for schema, tool order, and budgets
  • Baselines and regression gates with JUnit and HTML

Use both: keep DeepEval scoring and run it inside RunLedger's deterministic harness.

Core Concepts

Suites

A suite bundles cases, tool registry, contracts, and budgets into a single CI unit.

Cases

Each case defines a task input and a cassette for deterministic replay.

Cassettes

Record tool inputs and outputs once, then reuse them in CI.

Assertions

Hard contracts for JSON schema, required fields, and tool order.

Budgets

Enforce hard caps on latency, tool calls, and error rates.

Baselines

Promote a known-good run and gate PRs on regressions.

Assertions and budgets are hard gates.

Use JSON Schema, required fields, and tool-order checks for deterministic contracts, then layer budgets for latency and tool usage.

  • JSON schema validation for final output
  • Required fields, regex, and tool-order checks
  • Budget caps for wall time and tool calls
suite.yaml
assertions:
  - type: json_schema
    schema_path: schema.json
  - type: required_fields
    fields: ["category", "reply"]

budgets:
  max_wall_ms: 20000
  max_tool_calls: 10
  max_tool_errors: 0

Artifacts and reporting

Protocol details

Run logs

Every event is captured to JSONL for auditing and diffs.

run.jsonl

CI output

JUnit and summary JSON integrate directly with CI dashboards.

junit.xml

Shareable report

A static HTML report that opens anywhere, no server needed.

report.html

Summary metrics

Use summary.json for baseline diffs and regression gates.

summary.json

Ready for the CLI deep dive?

Explore every command, config field, and protocol message.

Open Reference

Want a faster path to CI hardening?

Book a fixed-scope Hardening Sprint or ongoing Assurance.