Guidance
Decide what a failure looks like first
Before you send a single message, write down the answers that would cost you money, trust, or safety: a wrong refund, an invented policy, a leaked detail, a missed handoff. Testing without a definition of failure just produces opinions.
Guidance
Pressure risky journeys before FAQs
Happy-path questions almost always pass. Real risk shows up when you rephrase the same ask, claim authority you do not have, push after a refusal, or switch topic mid-conversation. Those are the turns worth testing.
Guidance
Capture proof, then make a launch call
Every serious finding needs the customer turn, the bot reply, the safer behavior you expected, and how risky it is. That evidence is what turns 'the bot feels off' into a decision: ship, fix first, or do not launch yet.