Scope the launch risk
We start from the job your agent is supposed to do: support, sales, booking, intake, onboarding, or another customer-facing workflow. The test is framed around likely customer harm, not abstract cleverness.
Agent Torture Lab is built around a simple idea: the useful test is the one that produces evidence a team can fix. We pressure the agent like a confused, impatient, policy-bending customer, then turn the transcript into a launch report that explains what failed, why it matters, and what to rerun after the patch.
Last updated 2026-06-19.
We start from the job your agent is supposed to do: support, sales, booking, intake, onboarding, or another customer-facing workflow. The test is framed around likely customer harm, not abstract cleverness.
The run uses families of pressure such as policy bending, unsafe advice, privacy leakage, prompt injection, handoff failure, tone breakdown, accuracy drift, and conversion dead ends.
A finding needs a captured exchange or a clearly supplied transcript. Website runs are rejected when the captured text looks like page copy, cookie banners, legal text, or anything outside the chat surface. Deterministic detectors run first and stay authoritative; when AI assistance is enabled it is evidence-locked, so every AI finding must cite transcript quotes and collected website facts that actually exist.
The report converts failures into category scores, severity, confidence, and a launch recommendation. Deterministic results are the ground truth: the AI judge never overrides a deterministic failure and can only lower a finding's confidence. When the AI cross-judge is enabled and run budget allows, high and critical findings get a second, independent AI cross-check before publication, so a serious result is corroborated, downgraded, or flagged for human review; when it is unavailable the deterministic finding still publishes. The score is useful only because the evidence and fix path sit beside it.
The question, agent reply, failed behavior, and expected safer behavior.
A practical risk level tied to customer harm, compliance exposure, revenue loss, or trust damage.
A concrete change the builder, agency, or client can understand and retest.
A plain-English recommendation: launch, launch with fixes, or do not launch yet.
The question, agent reply, failed behavior, and expected safer behavior.
A practical risk level tied to customer harm, compliance exposure, revenue loss, or trust damage.
A concrete change the builder, agency, or client can understand and retest.
A plain-English recommendation: launch, launch with fixes, or do not launch yet.
If a target cannot be reached, a website widget cannot be safely captured, or a transcript is not adequate evidence, the right answer is a clear limitation. That is also why the deeper scoring explanation lives on the scoring methodology page.
How Agent Torture Lab tests prompt-injection risk in customer-facing chatbots without publishing reusable exploit recipes.
How to test AI chatbots for private-data exposure, account-specific answers, over-collection, and unsafe identity assumptions.
How to test whether AI chatbots escalate to a human or approved workflow before customers get trapped in risky loops.
How to test whether AI chatbots follow refund, discount, warranty, eligibility, safety, and business-rule policies under pressure.
How scores, evidence standards, and report verdicts are interpreted.
A practical readiness list for teams preparing a customer-facing agent.
Useful scenario categories without publishing proprietary prompt recipes.
What the decision artifact should include after the testing pass.
How launch reports compare with manual QA, generic LLM evals, and red-team tools.
Definitions for the terms used across launch reports and methodology pages.