Integrate the CI harness for tool-using agents.
RunLedger gives you deterministic replay, hard contracts, baselines, and PR gates without rewiring your agent stack.
Not an eval metrics framework. Keep DeepEval for scoring and use RunLedger for deterministic merge gates.
Compact CI loop: record once, promote a baseline, replay on every PR.
pipx install runledger runledger run ./evals --mode record runledger baseline promote --from RUN_DIR --to baselines/suite.json runledger run ./evals --mode replay
Use the RUN_DIR printed after record to promote a baseline.
GitHub Actions
ci.ymlname: agent-evals
on:
pull_request:
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install RunLedger
run: pip install runledger
- name: Run deterministic evals (replay)
run: runledger run ./evals --mode replay --baseline baselines/suite.json
- name: Upload artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: runledger-artifacts
path: runledger_out/**
Replay mode makes CI fast and deterministic. Artifacts include JSONL logs, summary JSON, JUnit XML, and a static HTML report.
- Exit non-zero on regressions and contract failures
- Use baseline_path in suite.yaml or pass --baseline
- Upload artifacts for PR review
Other CI providers
GitLab CI
Pin a Python image and upload artifacts on every run.
image: python:3.11
stages: [test]
agent_evals:
stage: test
script:
- pip install runledger
- runledger run ./evals --mode replay
artifacts:
when: always
paths:
- runledger_out/
CircleCI
Use the Python image, run replay mode, and store artifacts.
version: 2.1
jobs:
evals:
docker:
- image: cimg/python:3.11
steps:
- checkout
- run: pip install runledger
- run: runledger run ./evals --mode replay
- store_artifacts:
path: runledger_out
workflows:
evals:
jobs:
- evals
Jenkins
Publish JUnit results and archive artifacts for diffs.
pipeline {
agent any
stages {
stage("RunLedger") {
steps {
sh "pip install runledger"
sh "runledger run ./evals --mode replay"
}
}
}
post {
always {
junit "runledger_out/**/junit.xml"
archiveArtifacts artifacts: "runledger_out/**"
}
}
}
Framework adapters
LangChain / LangGraph
Bridge tool calls to the JSONL protocol and replay in CI.
LlamaIndex
Route tool calls through RunLedger and enforce hard contracts.
AutoGen / CrewAI
Keep multi-agent workflows deterministic with replayed tools.
Raw Python or Node
Use the JSONL protocol directly to stay framework-agnostic.
Protocol adapter pattern
RunLedger launches your agent as a subprocess and speaks JSONL over stdio. This keeps your agent stack unchanged while CI stays deterministic.
- Stdout is protocol JSON only; logs go to stderr
- Tool registry and contracts are declared in suite.yaml
- Cassettes replay tool results for stable CI runs
suite_name: support-triage agent_command: ["python", "agent.py"] mode: replay cases_path: cases tool_registry: - search_docs - create_issue baseline_path: baselines/support-triage.json
Local authoring loop
Record once with live tools, then replay locally and in CI.