Methodology

How we test AI agents without giving away the trapdoor map.

Agent Torture Lab is built around a simple idea: the useful test is the one that produces evidence a team can fix. We pressure the agent like a confused, impatient, policy-bending customer, then turn the transcript into a launch report that explains what failed, why it matters, and what to rerun after the patch.

Last updated 2026-06-19.

Method

The test is a structured launch-readiness pass, not random bot teasing.

01

Scope the launch risk

We start from the job your agent is supposed to do: support, sales, booking, intake, onboarding, or another customer-facing workflow. The test is framed around likely customer harm, not abstract cleverness.

02

Apply scenario families

The run uses families of pressure such as policy bending, unsafe advice, privacy leakage, prompt injection, handoff failure, tone breakdown, accuracy drift, and conversion dead ends.

03

Validate the evidence

A finding needs a captured exchange or a clearly supplied transcript. Website runs are rejected when the captured text looks like page copy, cookie banners, legal text, or anything outside the chat surface. Deterministic detectors run first and stay authoritative; when AI assistance is enabled it is evidence-locked, so every AI finding must cite transcript quotes and collected website facts that actually exist.

04

Score for decisions

The report converts failures into category scores, severity, confidence, and a launch recommendation. Deterministic results are the ground truth: the AI judge never overrides a deterministic failure and can only lower a finding's confidence. When the AI cross-judge is enabled and run budget allows, high and critical findings get a second, independent AI cross-check before publication, so a serious result is corroborated, downgraded, or flagged for human review; when it is unavailable the deterministic finding still publishes. The score is useful only because the evidence and fix path sit beside it.

What stays private

We can be transparent about standards without publishing the exploit cookbook.

We explain categories, gates, and scoring principles without publishing private scenario prompts.

We do not claim a run covers every possible customer behavior or every model jailbreak.

We avoid leaking customer-owned transcripts, credentials, business rules, or private policy details.

We separate illustrative examples from real findings so the marketing page does not pretend to be a benchmark.

Report shape

The deliverable is designed for a launch decision.

EvidenceEvidence

The question, agent reply, failed behavior, and expected safer behavior.

SeveritySeverity

A practical risk level tied to customer harm, compliance exposure, revenue loss, or trust damage.

Fix pathFix path

A concrete change the builder, agency, or client can understand and retest.

Launch callLaunch call

A plain-English recommendation: launch, launch with fixes, or do not launch yet.

Report outputs
01

The question, agent reply, failed behavior, and expected safer behavior.

02

A practical risk level tied to customer harm, compliance exposure, revenue loss, or trust damage.

03

A concrete change the builder, agency, or client can understand and retest.

04

A plain-English recommendation: launch, launch with fixes, or do not launch yet.

Honest by design

Unsupported is a valid result. Fake confidence is not.

If a target cannot be reached, a website widget cannot be safely captured, or a transcript is not adequate evidence, the right answer is a clear limitation. That is also why the deeper scoring explanation lives on the scoring methodology page.

Go deeper

Use these pages to plan, test, and explain the work.