Reachability gate
Can we safely talk to this endpoint or widget?
Agent Torture Lab runs adversarial customer simulations, validates the captured transcript, checks it against known failure patterns, and turns the evidence into a launch report. We are not trying to embarrass the bot. We are trying to find the problems a real customer would turn into refunds, risk, churn, or screenshots.
Can we safely talk to this endpoint or widget?
What does the site or supplied context say the bot should know?
Which customer behaviors should this agent survive?
Did we capture the chat surface, not page chrome or boilerplate?
What broke, why it matters, and what should be rerun?
API URLs go through public-URL safety checks, secret-looking query strings are rejected, and website tests get a compatibility precheck. If the bot cannot be reached cleanly, the run stops with a no-charge state instead of inventing a result.
Scenario packs are selected by channel and industry, then balanced across categories. A short run still touches privacy, escalation, safety, conversion, injection, accuracy, tone, and multilingual risk.
The simulated customer asks the awkward things real users ask: refund exceptions, policy challenges, unsafe advice, handoff demands, prompt probes, invented pricing, and confused multilingual turns.
Captured replies are rejected if they look like cookie banners, footer links, privacy text, page-body copy, repeated boilerplate, or text from outside the chat surface.
Deterministic detectors run first and stay authoritative. When AI assistance is enabled for context, it is evidence-locked: cited transcript quotes must exist and cited website facts must come from collected page evidence. The AI judge never overrides a deterministic failure and can only lower confidence. When the AI cross-judge is enabled and budget allows, high and critical findings get a second AI check before publication. If that pass is unavailable, the deterministic finding still publishes.
Findings are grouped by rule, severity, confidence, and category. The report says what failed, why it matters, what to fix, and which path to rerun after the patch.
Allowed target or no-charge block
Evidence notes and rules
Balanced test queue
Scoreable transcript
Score, verdict, fix list
Website chat testing is useful only when the runner captures the agent itself. The page body can look like a conversation if you squint, so the pipeline fails closed when the evidence is weak.
The captured answer follows a sent scenario turn, includes the agent's actual wording, and can be cited back to a transcript row.
Cookie banners, nav links, footer copy, privacy text, repeated placeholders, and static page paragraphs are rejected before scoring.
When live access is not possible, pasted transcripts can still be checked, but the page says what was and was not observed directly.
A high score can still carry a dangerous category. A low score should show exactly where the damage came from. The report breaks risk into categories so the next fix is obvious.
Critical issues move the score most
Weak or uncited claims do not get a free pass
Privacy, safety, and injection are weighted harder
Unsupported or partial captures change the verdict language
The report is meant to survive a skeptical founder, a client stakeholder, or the developer who has to patch the agent tomorrow. It does not hide behind a black-box grade. It shows the exchange, the failure type, the business risk, and the next test question to rerun.
A finding is more than a label. The report shows the test question, the bot answer, the failed rule, and the expected safer behavior.
Reports start at 100 and deduct for severity, category, confidence, and endpoint reliability. Critical privacy, safety, and injection failures hit hardest.
The output is launch, launch with fixes, or do not launch. The threshold comes from the score and the number of high-risk findings.
Repeated failures collapse into one backlog item where possible, so teams fix the root behavior instead of chasing duplicate notes.
API endpoint roasting is live. Website chat works best for simple public widgets, and manual transcript analysis is the fallback when a bot cannot be reached directly. That constraint is part of the product: a no-result is better than a confident-looking fake result.