Skip to content

Python Expect DSL

Fluent assertion API for testing AI agent outputs across 8 validation layers.

The ExpectChain class provides a fluent interface for chaining assertions. Each assertion returns the chain for further chaining.

from attest import expect
result = agent.run("question")
# Assertions can be chained
(expect(result)
.output_contains("expected")
.cost_under(0.05)
.latency_under(2000)
.passes_judge("Is correct?"))
from attest import expect
result = agent_function()
chain = expect(result)

The result object contains:

  • output — The agent’s text output
  • cost — Token cost in dollars
  • latency_ms — Execution time in milliseconds
  • trace — Full execution trace
  • metadata — Custom metadata

Attest assertions work across 8 layers:

LayerMethodsWhat it validates
1. Schemamatches_schema()JSON schema validation
2. Constraintscost_under(), latency_under()Performance metrics
3. Tracetrace_contains_model(), trace_contains_tool()Execution path
4. Contentoutput_contains(), output_matches()Text content
5. Embeddingsemantically_similar_to()Semantic meaning
6. LLM Judgepasses_judge()Domain-specific evaluation
7. Trace Treetrace_tree_valid(), tool_calls_valid()Execution structure
8. Simulationall_pass(), success_rate_above()Multi-agent scenarios

Validate output structure against a schema.

expect(result).matches_schema({
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"}
},
"required": ["name", "age"]
})

Check performance metrics and limits.

# Cost in dollars
expect(result).cost_under(0.10)
expect(result).cost_equals(0.05)
expect(result).cost_between(0.01, 0.10)
# Latency in milliseconds
expect(result).latency_under(5000)
expect(result).latency_equals(1000)
expect(result).latency_between(100, 5000)

Inspect what models and tools the agent used.

# Check model usage
expect(result).trace_contains_model("gpt-4o-mini")
expect(result).trace_contains_model("claude-3-sonnet")
# Check tool usage
expect(result).trace_contains_tool("google_search")
expect(result).trace_contains_tool("calculator")
# Verify no unexpected tools
expect(result).trace_contains_only_tools(["calculator", "wikipedia"])

Check the output text.

# Contains substring
expect(result).output_contains("hello")
# Does not contain
expect(result).output_not_contains("error")
# Exact match
expect(result).output_equals("exact output")
# Case insensitive
expect(result).output_contains("HELLO", case_sensitive=False)
import re
# Regex pattern
expect(result).output_matches(r"^\d{4}-\d{2}-\d{2}$") # Date format
# Contains all substrings
expect(result).output_contains_all(["hello", "world"])
# Contains any substring
expect(result).output_contains_any(["yes", "correct"])
# Starts/ends with
expect(result).output_starts_with("The")
expect(result).output_ends_with("?")
expect(result).word_count_equals(100)
expect(result).word_count_between(50, 200)
expect(result).word_count_under(500)

Check semantic meaning using embeddings.

# Semantically similar to reference text
expect(result).semantically_similar_to(
"This is a greeting",
threshold=0.85
)
# Semantically different from reference
expect(result).semantically_different_from(
"This is an error message",
threshold=0.85
)

Use an LLM to evaluate domain-specific correctness.

expect(result).passes_judge(
prompt="Is this response helpful?"
)
expect(result).passes_judge(
prompt="Is the math correct?",
model="gpt-4o",
scoring="binary" # binary, scale_0_10, or enum
)
expect(result).passes_judges([
("Is this helpful?", "gpt-4o-mini"),
("Is this accurate?", "gpt-4o"),
])
expect(result).passes_judge(
prompt="Grade this response",
rubric={
"clarity": "Is the explanation clear?",
"accuracy": "Are the facts correct?",
"completeness": "Does it answer fully?"
},
threshold=0.8
)

Validate the structure of the execution trace.

# Verify trace structure is valid
expect(result).trace_tree_valid()
# Verify specific tool calls
expect(result).tool_calls_valid()
# Check tool call count
expect(result).tool_call_count_equals(3)
expect(result).tool_call_count_between(1, 5)
# Verify no infinite loops
expect(result).trace_depth_under(10)

Validate multi-agent scenario results.

from attest import simulate
scenario = simulate.scenario()
scenario.add_agent(agent1)
scenario.add_agent(agent2)
results = scenario.run(repeat=5)
# All agents passed
expect(results).all_pass()
# Success rate
expect(results).success_rate_above(0.95)
expect(results).success_rate_equals(1.0)
# Average metrics
expect(results).avg_cost_under(0.10)
expect(results).avg_latency_under(2000)
# Agent-specific
expect(results).agent_success_rate("agent_1", above=0.90)

Collect all failures instead of stopping at the first one.

from attest import soft_fail
with soft_fail():
expect(result).output_contains("hello") # Failure recorded
expect(result).cost_under(0.01) # Failure recorded
expect(result).passes_judge("...") # Still executes
# After context, all 3 failures reported

When an assertion fails, you get detailed error information:

✗ Assertion failed: output_contains("goodbye")
Expected output to contain: goodbye
Actual output: hello world
Suggestion: Check if the prompt was clear

Continue testing after failures to collect all issues:

from attest import soft_fail
with soft_fail():
expect(result).output_contains("hello") # May fail
expect(result).cost_under(0.01) # May fail
# Both will run, collecting all failures

Use LLM evaluation for semantic correctness:

(expect(result)
.passes_judge(
prompt="Is the response grammatically correct?",
model="gpt-4o",
scoring="binary" # binary, scale_0_10, or enum
))

Test agents built with popular frameworks:

from attest.adapters import langchain, crewai, llamaindex
# LangChain agents
agent = langchain.create_agent(...)
# CrewAI tasks
task = crewai.create_task(...)
# LlamaIndex query engines
engine = llamaindex.create_query_engine(...)