# Agent Torture Lab Capabilities

Last updated: 2026-06-19
Site: https://www.agenttorture.com/
Contact: hello@agenttorture.com

## What Agent Torture Lab Does

Agent Torture Lab crash-tests customer-facing AI agents before launch. It runs messy customer simulations and turns failures into transcript evidence, severity, confidence, recommended fixes, and a rerun plan.

## Live Public Capabilities

- Public website chatbot roast: attempts to open compatible public chat widgets in an isolated browser and capture replies
- Public API endpoint roast: sends fixed scenario messages to a public endpoint that follows the documented test contract
- Manual transcript fallback: analyzes pasted transcript text when a bot cannot be reached directly
- Guest report unlock: one-time payment unlocks the full report after Stripe webhook verification
- Sample reports: public examples show the report shape, evidence model, and fix guidance

## Core Test Coverage

- Refund and policy pressure
- Prompt-injection style pressure
- Privacy and private-data handling
- Unsafe advice and regulated-claim boundaries
- Missed escalation and human handoff failures
- Multilingual context drift
- Conversion blockers and sales handoff failures
- Tone failures under frustrated customer pressure

## Evaluation Methodology

Evaluation is deterministic-first. Deterministic detectors run first and stay authoritative; their scores and failure flags are the ground truth.

- When enabled, an evidence-locked AI judge adds coverage in the full product: every AI finding must cite transcript quotes and collected website facts that actually exist. The AI judge never overrides a deterministic failure and can only lower a finding's confidence.
- When the AI cross-judge is enabled and run budget allows, high and critical findings are independently cross-checked before publication: the cross-judge corroborates the finding, lowers its confidence, or flags it for human review. When the cross-judge is disabled or over budget, the deterministic finding still publishes on its own — deterministic stays authoritative either way.
- The AI judge and cross-judge are an additive, bounded layer, not the primary scorer.
- The anonymous guest path (website chatbot roast, public API endpoint roast, manual transcript fallback, sample reports) is deterministic-only by design and does not use the AI judge or cross-judge.

## Not Currently Promised as Live

- Voice-agent execution
- Login-protected target testing
- WhatsApp, Instagram, or private-channel testing
- Continuous website monitoring
- Scheduled automated retests
- White-label delivery as a fully live public feature
- Full infrastructure penetration testing

## Best Fit

- Builders shipping support, sales, ecommerce, booking, checkout, or service chat agents
- Founders who need evidence before putting a bot in front of real customers
- Agencies that need a client-readable QA artifact before handoff
- Teams comparing manual chatbot QA, generic LLM evals, and red-team style testing

## Output

The primary output is an AI agent launch report. It should include the tested scope, transcript evidence, severity, confidence, recommended fixes, and a specific rerun path after changes.

Useful pages:

- Bot Roast: https://www.agenttorture.com/bot-roast
- AI agent launch report: https://www.agenttorture.com/reports/ai-agent-launch-report
- Methodology: https://www.agenttorture.com/methodology
- Pricing: https://www.agenttorture.com/pricing