How we do it

We ask more than three polite questions.

Agent Torture Lab runs adversarial customer simulations, validates the captured transcript, checks it against known failure patterns, and turns the evidence into a launch report. We are not trying to embarrass the bot. We are trying to find the problems a real customer would turn into refunds, risk, churn, or screenshots.

No API keys in endpoint URLsNo logins, cookies, uploads, or downloads for website runsUnsupported widgets are blocked from paid scored reportsManual transcript fallback uses fixed rules onlyDeterministic detectors are authoritative and run firstAI assistance cannot publish uncited transcript claimsThe AI judge only lowers confidence, never overrides a failureHigh and critical findings get an independent AI cross-check when that pass is enabled
Full path

Every result has to pass through the same visible chain.

01
Target

Reachability gate

Can we safely talk to this endpoint or widget?

Allowed target or no-charge block
02
Context

Business facts

What does the site or supplied context say the bot should know?

Evidence notes and rules
03
Pressure

Scenario pack

Which customer behaviors should this agent survive?

Balanced test queue
04
Transcript

Reality check

Did we capture the chat surface, not page chrome or boilerplate?

Scoreable transcript
05
Report

Launch decision

What broke, why it matters, and what should be rerun?

Score, verdict, fix list
Under the hood

The roast is a sequence of gates, not one mystery score.

01

We check the target before the roast starts

API URLs go through public-URL safety checks, secret-looking query strings are rejected, and website tests get a compatibility precheck. If the bot cannot be reached cleanly, the run stops with a no-charge state instead of inventing a result.

02

We choose pressure, not random chatter

Scenario packs are selected by channel and industry, then balanced across categories. A short run still touches privacy, escalation, safety, conversion, injection, accuracy, tone, and multilingual risk.

03

We send realistic customer pressure

The simulated customer asks the awkward things real users ask: refund exceptions, policy challenges, unsafe advice, handoff demands, prompt probes, invented pricing, and confused multilingual turns.

04

We prove the transcript is real enough to score

Captured replies are rejected if they look like cookie banners, footer links, privacy text, page-body copy, repeated boilerplate, or text from outside the chat surface.

05

We turn replies into findings

Deterministic detectors run first and stay authoritative. When AI assistance is enabled for context, it is evidence-locked: cited transcript quotes must exist and cited website facts must come from collected page evidence. The AI judge never overrides a deterministic failure and can only lower confidence. When the AI cross-judge is enabled and budget allows, high and critical findings get a second AI check before publication. If that pass is unavailable, the deterministic finding still publishes.

06

We assemble the launch report

Findings are grouped by rule, severity, confidence, and category. The report says what failed, why it matters, what to fix, and which path to rerun after the patch.

Visible chain
01

Allowed target or no-charge block

02

Evidence notes and rules

03

Balanced test queue

04

Scoreable transcript

05

Score, verdict, fix list

Evidence gate

Bad transcript capture is treated as a product risk, not a scoring shortcut.

Website chat testing is useful only when the runner captures the agent itself. The page body can look like a conversation if you squint, so the pipeline fails closed when the evidence is weak.

Accepted

A bot reply inside the tested chat

The captured answer follows a sent scenario turn, includes the agent's actual wording, and can be cited back to a transcript row.

Rejected

A page artifact pretending to be a reply

Cookie banners, nav links, footer copy, privacy text, repeated placeholders, and static page paragraphs are rejected before scoring.

Fallback

Manual transcript with fixed rules

When live access is not possible, pasted transcripts can still be checked, but the page says what was and was not observed directly.

Failure coverage

We look for the failures that make agents expensive after launch.

Policy bending

  • Refund abuse
  • Discount pressure
  • Identity bypass
  • Invented policies

Safety and trust

  • Unsafe claims
  • Privacy leakage
  • Human handoff misses
  • Over-compliance

Agent control

  • Prompt injection
  • Hidden-instruction leaks
  • Off-domain answers
  • Silent failures

Customer experience

  • No next step
  • Bad tone
  • Multilingual loss
  • Repeated boilerplate
Score anatomy

The graphs are there to show the shape of risk, not hide it.

A high score can still carry a dangerous category. A low score should show exactly where the damage came from. The report breaks risk into categories so the next fix is obvious.

Category signalIllustrative run
Privacy72
Injection58
Escalation76
Conversion61
Safety69
Accuracy84
Severity

Critical issues move the score most

Evidence

Weak or uncited claims do not get a free pass

Category

Privacy, safety, and injection are weighted harder

Reliability

Unsupported or partial captures change the verdict language

Why the report feels credible

Evidence first. Then the score. Then the fixes.

The report is meant to survive a skeptical founder, a client stakeholder, or the developer who has to patch the agent tomorrow. It does not hide behind a black-box grade. It shows the exchange, the failure type, the business risk, and the next test question to rerun.

Report anatomyWhat you get back
  1. Verdict and score
  2. Transcript evidence
  3. Severity and confidence
  4. Business impact
  5. Fix recommendation
  6. Rerun question

Every serious issue needs evidence

A finding is more than a label. The report shows the test question, the bot answer, the failed rule, and the expected safer behavior.

The score is explainable

Reports start at 100 and deduct for severity, category, confidence, and endpoint reliability. Critical privacy, safety, and injection failures hit hardest.

The verdict is a launch decision

The output is launch, launch with fixes, or do not launch. The threshold comes from the score and the number of high-risk findings.

The fix list is operational

Repeated failures collapse into one backlog item where possible, so teams fix the root behavior instead of chasing duplicate notes.

Honest status

API endpoint and website chat roasting are live.

API endpoint roasting is live. Website chat works best for simple public widgets, and manual transcript analysis is the fallback when a bot cannot be reached directly. That constraint is part of the product: a no-result is better than a confident-looking fake result.