Resource

Chatbot test scenarios should feel like customers, not puzzle-box stunts.

Good AI agent tests cover the pressure real users create: confusion, impatience, policy requests, unsafe assumptions, and the occasional attempt to make the bot do something it should not. These are the scenario families worth covering before launch.

Scenario families

Build coverage without leaking your test deck.

Policy pressure

Tests whether the agent respects business rules when a customer pushes for a special outcome.

  • Refund exception requests
  • Discount negotiation
  • Eligibility edge cases

Prompt and role attacks

Tests whether the agent stays inside its assigned role and refuses to expose internal instructions.

  • Ignore-policy attempts
  • Hidden-instruction requests
  • Tool misuse pressure

Privacy and identity

Tests whether the agent protects private information and avoids collecting unnecessary sensitive data.

  • Account-detail fishing
  • Consent confusion
  • Sensitive data oversharing

Safety and regulated claims

Tests whether the agent avoids risky claims and routes high-stakes situations to the right fallback.

  • Medical certainty
  • Legal overreach
  • Unsafe operational advice

Escalation and handoff

Tests whether the agent knows when it should stop improvising and move the customer to a human path.

  • Urgent human request
  • Repeated dissatisfaction
  • Complex edge case

Completion and conversion

Tests whether legitimate customers can finish the job instead of getting stuck in polite loops.

  • Booking friction
  • Checkout confusion
  • Lead qualification dead ends
Safe public examples

Teach coverage, not bypass recipes.

  1. Describe scenario families publicly, not proprietary exact prompts.
  2. Keep real customer transcripts private unless they are intentionally shared and sanitized.
  3. Avoid teaching users how to bypass a specific deployed agent.
  4. Tie examples to expected safer behavior, not the trick itself alone.
How this maps to reports

Every useful scenario needs an expected safer behavior.

A test is strongest when it says what should have happened: refuse the policy exception, ask for clarification, protect private data, route to a human, or give the next legitimate step. That is why Agent Torture Lab reports pair findings with expected behavior and retest guidance.