Attest: Test Your AI Agents Like You Test Your Code
AI agents are going to production without real testing.
57% of organizations now run agents in production. Yet most evaluation relies on LLM-as-judge — a probabilistic system evaluating a probabilistic system. The result: eval suites that cost 10x more than the agent itself, produce non-deterministic results, and still miss the failures that matter most — wrong tool calls, budget overruns, broken output schemas, violated state machines.
The industry treats agent evaluation as an AI problem. It’s a testing problem.
Today we’re open-sourcing Attest, a testing framework for AI agents that applies a simple discipline: reach for the cheapest valid assertion first, and only escalate when necessary.
The Core Idea: Graduated Assertions
Section titled “The Core Idea: Graduated Assertions”60-70% of an agent’s testable surface is fully deterministic. Tool call schemas, execution order, cost budgets, content presence/absence, structured output format, state transitions. Routing all of this through an LLM judge is a category error — expensive, slow, and unnecessarily non-deterministic.
Attest implements an 8-layer assertion pipeline that exhausts deterministic checks before touching probabilistic ones:
| Layer | What It Tests | Cost | Speed |
|---|---|---|---|
| 1. Schema Validation | Output structure, required fields, types | Free | <1ms |
| 2. Cost & Performance | Token budgets, latency limits, cost caps | Free | <1ms |
| 3. Trace Structure | Tool ordering, required/forbidden tools, loop detection | Free | <1ms |
| 4. Content Validation | Contains, not-contains, regex patterns | Free | <5ms |
| 5. Semantic Similarity | Meaning-level comparison via local ONNX embeddings | Free (local) | ~100ms |
| 6. LLM-as-Judge | Rubric-based scoring for subjective quality | ~$0.01 | ~1-3s |
| 7. Simulation | Persona-driven users, mock tools, fault injection | Free (mocked) | Variable |
| 8. Multi-Agent | Delegation chains, cross-agent assertions, aggregate metrics | Free | ~5ms |
Layers 1-4 cover the deterministic 70% at zero cost. Layer 5 runs semantic similarity locally with ONNX — no API key, no network call. Layer 6, the expensive LLM judge, is reserved for the genuinely subjective remainder. Layers 7-8 add simulation and multi-agent testing that no other open-source tool provides.
What It Looks Like
Section titled “What It Looks Like”from attest import agent, expectfrom attest.trace import TraceBuilder
@agent("support-agent")def support_agent(builder: TraceBuilder, user_message: str): builder.add_llm_call(name="gpt-4.1", args={"model": "gpt-4.1"}, result={...}) builder.add_tool_call(name="lookup_user", args={"query": user_message}, result={...}) builder.add_tool_call(name="reset_password", args={"user_id": "U-123"}, result={...}) builder.set_metadata(total_tokens=150, cost_usd=0.005, latency_ms=1200) return {"message": "Your temporary password is abc123.", "structured": {...}}
def test_support_agent(attest): result = support_agent(user_message="Reset my password")
chain = ( expect(result) .output_matches_schema({"type": "object", "required": ["message"]}) # L1 .cost_under(0.05) # L2 .tools_called_in_order(["lookup_user", "reset_password"]) # L3 .output_contains("temporary password") # L4 .output_similar_to("password has been reset", threshold=0.8) # L5 .passes_judge("Was the password reset handled correctly?") # L6 ) attest.evaluate(chain)No dashboard to configure. No YAML to author. Tests live next to code, run in CI, fail the build when agents break.
TypeScript works the same way:
import { Agent, TraceBuilder, attestExpect } from "@attest-ai/core";
const supportAgent = new Agent("support-agent", (builder: TraceBuilder, args) => { builder.addToolCall("lookup_user", { args: { query: args.user_message }, result: {...} }); builder.addToolCall("reset_password", { args: { user_id: "U-123" }, result: {...} }); builder.setMetadata({ total_tokens: 150, cost_usd: 0.005, latency_ms: 1200 }); return { message: "Your temporary password is abc123." };});
const result = supportAgent.run({ user_message: "Reset my password" });
attestExpect(result) .outputContains("temporary password") .costUnder(0.05) .toolsCalledInOrder(["lookup_user", "reset_password"]) .passesJudge("Was the password reset handled correctly?");Architecture
Section titled “Architecture”SDK (Python / TypeScript) --stdio--> Engine (Go binary) |-- 8-Layer Assertion Pipeline |-- Result History (SQLite) |-- Drift Detection (sigma-based) |-- Simulation Runtime +-- Plugin SystemThe engine is a single Go binary with zero runtime dependencies. Six platform targets (macOS, Linux, Windows x amd64/arm64). Cold start: 1.7-3.2ms. 100-step trace evaluation: <2ms.
SDKs are thin idiomatic wrappers — all evaluation logic lives in the engine. This means the Python SDK, TypeScript SDK, and any future SDK produce identical assertion results. No reimplementation divergence.
11 Adapters, Zero Lock-In
Section titled “11 Adapters, Zero Lock-In”Attest works with whatever you’re building on:
Provider adapters — OpenAI, Anthropic, Gemini, Ollama Framework adapters — LangChain, Google ADK, LlamaIndex, CrewAI, OpenTelemetry Manual adapter — For custom agent architectures Simulation adapter — For testing without real API calls
Each adapter translates framework-specific events into Attest’s universal trace format. Write your assertions once; swap providers without changing tests.
Continuous Evaluation & Drift Detection
Section titled “Continuous Evaluation & Drift Detection”Testing in CI is table stakes. Attest also supports continuous evaluation in production:
from attest import ContinuousEvalRunner, Sampler, AlertDispatcher, AttestClient, EngineManagerfrom attest.assertions import Assertion
engine = EngineManager(engine_path="./attest-engine")client = AttestClient(engine)
runner = ContinuousEvalRunner( client=client, assertions=[ Assertion(type="constraint", spec={"field": "metadata.cost_usd", "operator": "lte", "value": 0.10}), Assertion(type="content", spec={"check": "non_empty"}), ], sample_rate=0.1, # Evaluate 10% of production traces)
await runner.start()# Submit traces from your production pipelineawait runner.submit(trace)The engine stores result history in SQLite and computes sigma-based drift detection. When assertion scores deviate beyond configured thresholds, Attest fires drift_alert notifications to webhooks or Slack. No external infrastructure required.
Simulation: Test What Hasn’t Happened Yet
Section titled “Simulation: Test What Hasn’t Happened Yet”Layer 7 is a simulation runtime with mock tools and persona-driven assertions:
from attest import agent, expectfrom attest.trace import TraceBuilderfrom attest.simulation import MockToolRegistry, mock_toolfrom attest.simulation.personas import ADVERSARIAL_USER
@mock_tool("search")def mock_search(query: str) -> dict: return {"results": []} # Simulate empty results
@agent("search-agent")def search_agent(builder: TraceBuilder, query: str): builder.add_tool_call(name="search", args={"query": query}, result={"results": []}) builder.set_metadata(total_tokens=100, cost_usd=0.002, latency_ms=500) return {"answer": "No results found for your query."}
def test_agent_handles_empty_results(attest): with MockToolRegistry() as registry: registry.register_decorated(mock_search) result = search_agent(query="nonexistent topic")
chain = ( expect(result) .output_contains("No results") .forbidden_tools(["admin_override", "delete_data"]) .cost_under(0.05) ) attest.evaluate(chain)Test edge cases, adversarial inputs, and failure modes without hitting real APIs. Built-in personas include FRIENDLY_USER, COOPERATIVE_USER, CONFUSED_USER, and ADVERSARIAL_USER.
Multi-Agent Testing
Section titled “Multi-Agent Testing”Layer 8 handles the growing reality of multi-agent systems:
from attest import agent, expect, delegate, TraceTreefrom attest.trace import TraceBuilder
@agent("orchestrator")def orchestrator(builder: TraceBuilder, task: str): with delegate("researcher") as sub: sub.add_tool_call(name="web_search", args={"q": task}, result={"findings": "..."}) sub.set_output({"summary": "Research complete"}) sub.set_metadata(total_tokens=200, cost_usd=0.004) builder.set_metadata(total_tokens=300, cost_usd=0.008, latency_ms=2000) return {"report": "Final report based on research."}
def test_multi_agent(attest): result = orchestrator(task="Analyze market trends")
chain = ( expect(result) .agent_called("researcher") .delegation_depth(2) .follows_transitions([["orchestrator", "researcher"]]) .aggregate_cost_under(0.50) .aggregate_tokens_under(10000) ) attest.evaluate(chain)
# Direct trace tree inspection tree = TraceTree(root=result.trace) assert tree.aggregate_tokens > 0 assert tree.aggregate_cost > 0Trace trees reconstruct the full delegation chain — which agent called which, with what inputs, producing what outputs. Assert on the graph structure, not just the final answer.
Plugin System
Section titled “Plugin System”Attest supports third-party plugins via Python entry points:
# In your plugin's pyproject.toml[project.entry-points."attest.plugins"]my-custom-eval = "my_package:MyEvalPlugin"Plugins implement the AttestPlugin protocol — define a name, plugin_type, and execute(trace, spec) method returning a PluginResult. The PluginRegistry discovers and loads them automatically at runtime via the attest.plugins entry point group.
How Attest Compares
Section titled “How Attest Compares”vs. LangWatch — LangWatch is an observability platform (BSL 1.1 licensed — effectively proprietary) that also does evaluation and simulation. Its evaluator architecture structurally limits assertions to input, output, contexts, and expected_output — trace-level data (span hierarchy, tool call parameters, cost, latency) cannot reach evaluators without a breaking API change. Attest operates directly on full execution traces with assertions over tool ordering, cost budgets, delegation chains, and multi-agent graphs — all in CI, with zero platform infrastructure, under Apache 2.0.
vs. DeepEval / Ragas — These default to LLM-as-judge for everything. Attest exhausts deterministic assertions first, uses LLM judges only as a last resort, and provides soft failure budgets so probabilistic layers don’t create CI noise.
vs. Langfuse / Arize / LangSmith — Observability platforms that track what happened. Attest is a testing framework that asserts what should happen. Different tools for different jobs — and Attest integrates with all of them via OpenTelemetry.
vs. Promptfoo — Config-driven prompt evaluation. Attest is code-first agent testing with full trace-level assertions, simulation, and multi-agent support.
No other open-source tool combines deterministic assertion layers, local embeddings, simulation with fault injection, and multi-agent trace testing in a single framework. Attest is Apache 2.0 with zero infrastructure requirements — no platform to host, no vendor lock-in, no BSL license gotchas.
Get Started
Section titled “Get Started”pip install attest-ai # Pythonnpm install @attest-ai/core # TypeScriptOr scaffold a project:
python -m attest initThis generates an attest.toml config and example test file.
Links:
- GitHub: github.com/attest-framework/attest
- PyPI: pypi.org/project/attest-ai
- npm: @attest-ai/core
Apache 2.0 licensed. Issues and PRs welcome.
Attest is v0.4.0 today — 8 assertion layers, 11 adapters, continuous eval, drift detection, plugin system, Python + TypeScript SDKs, pytest + vitest integrations. The roadmap includes a Go SDK and cloud dashboard.