INTEGRATIONS

Integrate the CI harness for tool-using agents.

RunLedger gives you deterministic replay, hard contracts, baselines, and PR gates without rewiring your agent stack.

Not an eval metrics framework. Keep DeepEval for scoring and use RunLedger for deterministic merge gates.

record -> promote -> replay -> gate

Compact CI loop: record once, promote a baseline, replay on every PR.

Record once

Replay in CI

Gate merges

baseline-workflow.sh

pipx install runledger
runledger run ./evals --mode record
runledger baseline promote --from RUN_DIR --to baselines/suite.json
runledger run ./evals --mode replay

Use the RUN_DIR printed after record to promote a baseline.

record promote baseline replay in CI

GitHub Actions

ci.yml

name: agent-evals
on:
  pull_request:

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install RunLedger
        run: pip install runledger
      - name: Run deterministic evals (replay)
        run: runledger run ./evals --mode replay --baseline baselines/suite.json
      - name: Upload artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: runledger-artifacts
          path: runledger_out/**

Replay mode makes CI fast and deterministic. Artifacts include JSONL logs, summary JSON, JUnit XML, and a static HTML report.

Exit non-zero on regressions and contract failures
Use baseline_path in suite.yaml or pass --baseline
Upload artifacts for PR review

Other CI providers

GitLab CI

Pin a Python image and upload artifacts on every run.

image: python:3.11
stages: [test]

agent_evals:
  stage: test
  script:
    - pip install runledger
    - runledger run ./evals --mode replay
  artifacts:
    when: always
    paths:
      - runledger_out/

CircleCI

Use the Python image, run replay mode, and store artifacts.

version: 2.1
jobs:
  evals:
    docker:
      - image: cimg/python:3.11
    steps:
      - checkout
      - run: pip install runledger
      - run: runledger run ./evals --mode replay
      - store_artifacts:
          path: runledger_out
workflows:
  evals:
    jobs:
      - evals

Jenkins

Publish JUnit results and archive artifacts for diffs.

pipeline {
  agent any
  stages {
    stage("RunLedger") {
      steps {
        sh "pip install runledger"
        sh "runledger run ./evals --mode replay"
      }
    }
  }
  post {
    always {
      junit "runledger_out/**/junit.xml"
      archiveArtifacts artifacts: "runledger_out/**"
    }
  }
}

Framework adapters

LangChain / LangGraph

Bridge tool calls to the JSONL protocol and replay in CI.

LlamaIndex

Route tool calls through RunLedger and enforce hard contracts.

AutoGen / CrewAI

Keep multi-agent workflows deterministic with replayed tools.

Raw Python or Node

Use the JSONL protocol directly to stay framework-agnostic.

Protocol adapter pattern

RunLedger launches your agent as a subprocess and speaks JSONL over stdio. This keeps your agent stack unchanged while CI stays deterministic.

Stdout is protocol JSON only; logs go to stderr
Tool registry and contracts are declared in suite.yaml
Cassettes replay tool results for stable CI runs

Protocol reference

suite.yaml

suite_name: support-triage
agent_command: ["python", "agent.py"]
mode: replay
cases_path: cases
tool_registry:
  - search_docs
  - create_issue
baseline_path: baselines/support-triage.json

Local authoring loop

Record once with live tools, then replay locally and in CI.

01 Run with --mode record to capture tool outputs.

02 Promote the run to a baseline with runledger baseline promote.

03 Commit cassettes and baselines to keep CI deterministic.

04 Replay in CI and gate merges on regressions and budgets.