8-Layer Assertion Pipeline
The assertion pipeline is the core evaluation engine in Attest. It processes assertions through eight layers ordered by cost, evaluating the cheapest deterministic checks first and only escalating to expensive probabilistic evaluation when necessary.
Pipeline Overview
Section titled “Pipeline Overview”flowchart TD INPUT["Trace + Assertions"] --> L1
subgraph FREE["Free & Deterministic"] L1["Layer 1: Schema<br/>JSON Schema validation"] L2["Layer 2: Constraint<br/>Value ranges, bounds"] L3["Layer 3: Trace<br/>Step ordering, tool calls"] L4["Layer 4: Content<br/>Regex, keywords, patterns"] end
subgraph PAID["Paid & Probabilistic"] L5["Layer 5: Embedding<br/>Semantic similarity<br/>≪$0.001/call"] L6["Layer 6: LLM Judge<br/>Natural language eval<br/>~$0.001/call"] end
subgraph STRUCTURAL["Free & Structural"] L7["Layer 7: Trace Tree<br/>Multi-agent chains"] end
subgraph CUSTOM["Custom"] L8["Layer 8: Plugin<br/>User-defined logic"] end
L1 --> L2 --> L3 --> L4 --> L5 --> L6 --> L7 --> L8 L8 --> RESULT["Results: pass / soft_fail / hard_fail"]Evaluation Order
Section titled “Evaluation Order”The engine evaluates layers sequentially within a batch. Deterministic layers (L1-L4) run first. If any deterministic assertion hard-fails and fail_fast is enabled, the engine skips expensive L5/L6 evaluation entirely.
flowchart LR A["Batch Request"] --> B["Phase 1: L1-L4<br/>(sequential, <5ms each)"] B --> C{Any hard fail?} C -->|"Yes + fail_fast"| D["Skip L5/L6<br/>Return results"] C -->|"No"| E["Phase 2: L5/L6<br/>(parallel, 100ms-3s each)"] E --> F["Phase 3: L7/L8<br/>Return all results"]Layer 1: Schema Validation
Section titled “Layer 1: Schema Validation”Cost: Free | Speed: <1ms | Deterministic: Yes
Validates structured data against JSON Schema definitions. Use for tool call arguments, structured output fields, and API response shapes.
What It Checks
Section titled “What It Checks”- Tool call arguments match their declared schema
- Agent output conforms to a required structure
- Specific fields exist with correct types
SDK API
Section titled “SDK API”# Output matches a JSON Schemaexpect(result).output_matches_schema({ "type": "object", "required": ["response", "confidence"], "properties": { "response": {"type": "string"}, "confidence": {"type": "number", "minimum": 0, "maximum": 1} }})
# Tool arguments match schemaexpect(result).tool_args_match_schema("process_refund", { "type": "object", "required": ["order_id", "amount"], "properties": { "order_id": {"type": "string", "pattern": "^ORD-\\d+$"}, "amount": {"type": "number", "minimum": 0} }})Engine Behavior
Section titled “Engine Behavior”The engine uses JSON Schema Draft 2020-12 validation. Schema errors produce hard_fail with the validation error path and message in the explanation field.
Layer 2: Constraint Checks
Section titled “Layer 2: Constraint Checks”Cost: Free | Speed: <1ms | Deterministic: Yes
Evaluates numeric bounds, value ranges, and quantitative constraints on trace metadata and output fields.
What It Checks
Section titled “What It Checks”- Token count within budget
- API cost under threshold
- Latency under limit
- Numeric output fields within range
- Step count bounds
SDK API
Section titled “SDK API”# Cost and performance constraintsexpect(result).cost_under(0.05)expect(result).total_tokens_under(3000)expect(result).latency_under(ms=2000)
# Output field constraintsexpect(result).output_field_between("confidence", 0.0, 1.0)expect(result).step_count_under(15)Engine Behavior
Section titled “Engine Behavior”Constraint violations produce hard_fail. The engine extracts the target value from the trace and compares against the specified bounds. Missing fields produce hard_fail with an explanation indicating the field was not found.
Layer 3: Trace Inspection
Section titled “Layer 3: Trace Inspection”Cost: Free | Speed: <1ms | Deterministic: Yes
Analyzes the execution trace for structural properties: tool call presence, ordering, loops, state machine transitions, and temporal relationships.
What It Checks
Section titled “What It Checks”| Check Type | Description |
|---|---|
contains | Tool was called at least once |
not_contains | Tool was never called |
contains_in_order | Tools called in specified order (other tools may appear between) |
exact_order | Tools called in exact sequence |
no_duplicates | No tool called more than once |
loop_detection | No cyclic tool call patterns |
state_transitions | Execution follows valid state machine transitions |
max_steps | Total step count under limit |
max_llm_calls | LLM call count under limit |
agent_ordered_before | Agent A completed before Agent B started |
agents_overlap | Two agents executed with overlapping time windows |
agent_wall_time_under | Agent total wall time under threshold |
ordered_agents | Agent groups executed in sequential order |
SDK API
Section titled “SDK API”# Tool presence and orderingexpect(result).to_call_tool("lookup_order")expect(result).to_not_call_tool("delete_account")expect(result).tool_called_before("check_eligibility", "process_refund")
# Loop and duplicate detectionexpect(result).no_duplicate_tool_calls()expect(result).step_count_under(15)
# State machine transitionsexpect(result).follows_transitions({ "IDENTIFY": {"VERIFY"}, "VERIFY": {"CHECK_ELIGIBILITY"}, "CHECK_ELIGIBILITY": {"PROCESS", "DENY"},})
# Temporal assertions (multi-agent)expect(result).trace_tree().agent("Critic").started_after("FixProposer")expect(result).trace_tree().agents("LogAnalyst", "RunbookResearcher").overlap()expect(result).trace_tree().agent("LogAnalyst").wall_time_under(ms=5000)
# Ordered agent sequencesexpect(result).to_follow_agent_order([ ["LogAnalyst", "RunbookResearcher"], # parallel group "FixProposer", # must follow research "Critic", # must follow proposal])Engine Behavior
Section titled “Engine Behavior”Trace assertions operate on the steps array and optional temporal metadata (started_at_ms, ended_at_ms). Temporal assertions require adapters to populate timestamp fields — missing temporal data produces hard_fail with a diagnostic message.
Layer 4: Content Pattern Matching
Section titled “Layer 4: Content Pattern Matching”Cost: Free | Speed: <5ms | Deterministic: Near (regex-based)
Matches output text against patterns, keywords, and content rules. Use for verifying the agent mentioned required information, avoided prohibited content, or followed formatting rules.
What It Checks
Section titled “What It Checks”- Substring presence or absence
- Regex pattern matching
- PII detection patterns (SSN, credit card, email)
- Keyword lists (must-include or must-exclude)
SDK API
Section titled “SDK API”# Content presenceexpect(result).output_contains("ORD-123456")expect(result).output_contains_any(["refund processed", "refund approved"])
# Content absenceexpect(result).output_not_contains("internal_api_key")expect(result).output_not_contains_any(["jira", "salesforce", "zendesk"])
# Regex patternsexpect(result).output_matches_pattern(r"\$\d+\.\d{2}") # dollar amountexpect(result).output_not_matches_pattern(r"\b\d{3}-\d{2}-\d{4}\b") # SSNEngine Behavior
Section titled “Engine Behavior”Content assertions use Go’s regexp package. Pattern matching is case-sensitive by default. The engine evaluates all content assertions against the output.message field of the trace.
Layer 5: Embedding Similarity
Section titled “Layer 5: Embedding Similarity”Cost: ~$0.001/call | Speed: ~100ms | Deterministic: Near
Computes semantic similarity between the agent’s output and a reference text using embedding vectors. Use when you need to verify meaning without matching exact wording.
What It Checks
Section titled “What It Checks”- Output meaning is close to an expected answer
- Response covers required topics
- Denial messages convey the right reason
SDK API
Section titled “SDK API”expect(result).semantically_similar_to( "Your order is outside the 30-day return window and is not eligible for a refund.", threshold=0.7)Engine Behavior
Section titled “Engine Behavior”The engine computes cosine similarity between embedding vectors of the output text and the reference text. Embedding provider is configurable via ATTEST_EMBEDDING_PROVIDER (options: auto, openai, onnx).
Scores map to the soft failure system:
>= threshold—passwith score = similarity value< thresholdand>= 0.5—soft_fail< 0.5—hard_fail
Layer 6: LLM-as-Judge
Section titled “Layer 6: LLM-as-Judge”Cost: ~$0.01+/call | Speed: 1-3s | Deterministic: No
Uses an LLM to evaluate subjective qualities: empathy, helpfulness, coherence, faithfulness, tone. This is the most powerful but most expensive and least deterministic layer.
What It Checks
Section titled “What It Checks”- Subjective quality dimensions (empathy, professionalism, coherence)
- Faithfulness to source material
- Tone and style compliance
- Complex reasoning evaluation
SDK API
Section titled “SDK API”expect(result).judge_score("empathy", above=0.7)expect(result).judge_score("helpfulness", above=0.8)expect(result).judge_score("professionalism", above=0.8)Engine Behavior
Section titled “Engine Behavior”The engine constructs a judge prompt with the agent’s output, the evaluation dimension, and a scoring rubric. The judge LLM returns a score from 0.0 to 1.0 with an explanation.
Configuration:
ATTEST_JUDGE_PROVIDER— LLM provider (openai,anthropic,gemini,ollama)ATTEST_JUDGE_MODEL— Model name (e.g.,gpt-4.1)
Layer 7: Trace Tree
Section titled “Layer 7: Trace Tree”Cost: Free | Speed: <5ms | Deterministic: Yes
Analyzes the structure of multi-agent execution trees. Use for verifying agent delegation patterns, tree depth, and cross-agent relationships.
What It Checks
Section titled “What It Checks”- Agent delegation structure matches expected topology
- Trace tree depth within bounds
- Specific agents present in the tree
- Cross-agent data flow
SDK API
Section titled “SDK API”tree = TraceTree(root=result.trace)
# Verify agents participatedassert "LogAnalyst" in tree.agentsassert "Critic" in tree.agents
# Verify tree depthassert tree.depth <= 3
# Find specific agent sub-tracesanalyst_trace = tree.find_agent("LogAnalyst")assert analyst_trace is not NoneEngine Behavior
Section titled “Engine Behavior”Trace tree assertions operate on the nested sub_trace fields within agent_call steps. The engine traverses the tree to locate agents, compute depth, and verify structural properties.
Layer 8: Plugin
Section titled “Layer 8: Plugin”Cost: Varies | Speed: Varies | Deterministic: Varies
Runs user-defined evaluation logic. Plugins execute outside the engine and submit results back via the submit_plugin_result JSON-RPC method.
What It Checks
Section titled “What It Checks”- Custom business logic evaluation
- External API validation
- Domain-specific scoring
- Integration with third-party evaluation tools
SDK API
Section titled “SDK API”# Plugin assertions are submitted asynchronouslyawait client.submit_plugin_result( trace_id=trace.trace_id, plugin_name="compliance_checker", assertion_id="assert_compliance_01", status="pass", score=0.95, explanation="All compliance rules satisfied",)Engine Behavior
Section titled “Engine Behavior”The engine registers plugin assertions during evaluate_batch but does not evaluate them immediately. Instead, it waits for the SDK to submit plugin results via submit_plugin_result. The engine correlates results by assertion_id.
Soft Failure Budget
Section titled “Soft Failure Budget”Layers 5 and 6 produce continuous scores (0.0 to 1.0) rather than binary pass/fail. The soft failure system classifies these scores:
flowchart LR SCORE["Score: 0.0 - 1.0"] --> CHECK{Score value} CHECK -->|"< 0.5"| HF["hard_fail<br/>Block merge"] CHECK -->|"0.5 - 0.8"| SF["soft_fail<br/>Warn, count against budget"] CHECK -->|"> 0.8"| PASS["pass"]
SF --> BUDGET{Within budget?} BUDGET -->|Yes| ALLOW["Allow merge"] BUDGET -->|No| BLOCK["Block merge"]Configure the budget per test suite:
# Allow up to 2 soft failures in a test suite before blockingattest.config.soft_failure_budget = 2Tier Mapping
Section titled “Tier Mapping”The tier system maps assertion layers to cost categories for test scheduling:
| Tier | Layers | Cost per Assertion | Typical Use |
|---|---|---|---|
| TIER_1 | L1, L2, L3, L4, L7 | Free | Development, every commit |
| TIER_2 | L5 | ~$0.001 | PR validation |
| TIER_3 | L6 | ~$0.01+ | Pre-merge, nightly |
from attest import tier, TIER_1
@tier(TIER_1)async def test_tool_schema(result): expect(result).output_matches_schema(schema) expect(result).to_call_tool("lookup_order") expect(result).output_contains("ORD-123")Run tier-filtered tests:
# Free tests only (development)ATTEST_MAX_TIER=1 pytest -m attest
# Include embeddings (PR check)ATTEST_MAX_TIER=2 pytest -m attest
# Full suite (nightly CI)pytest -m attestBatch Evaluation Parallelization
Section titled “Batch Evaluation Parallelization”The engine splits batch evaluation into two phases:
- Phase 1 (sequential): Deterministic layers L1-L4 evaluate in sequence. Each takes <5ms.
- Phase 2 (parallel): Probabilistic layers L5-L6 evaluate concurrently with a configurable concurrency limit (default: 4).
This prevents a batch with 5 LLM judge assertions at 2s each from taking 10s sequentially — parallel evaluation brings it to ~2.5s.
| Configuration | Default | Description |
|---|---|---|
ATTEST_EVAL_CONCURRENCY | 4 | Max concurrent L5/L6 evaluations per batch |
ATTEST_EVAL_FAIL_FAST | true | Skip L5/L6 if any L1-L4 assertion hard-fails |