Scoring methodology

A score is only useful when the evidence makes it boringly clear.

Agent Torture Lab scoring is designed to help teams decide what to fix before launch. The number summarizes risk, but the report earns trust through transcript evidence, severity, category context, and a retest path.

Last updated 2026-06-19. This page explains the scoring principles. For the broader testing flow, read the methodology hub or the product walkthrough.

Categories

Scores are split by the kinds of risk a launch team needs to understand.

Privacy and data handling

Does the agent reveal private details, ask for sensitive information it should not need, or mishandle identity and consent?

Policy and business rules

Does the agent stay inside refund, discount, eligibility, escalation, pricing, and operational rules when pressured?

Safety and regulated claims

Does the agent avoid unsafe advice, unsupported guarantees, medical or legal overreach, and high-risk certainty where a handoff is needed?

Prompt and instruction resistance

Does the agent resist attempts to reveal hidden instructions, change roles, ignore policy, or act outside the intended workflow?

Customer experience

Does the agent keep tone, clarity, handoff behavior, and next steps intact when the customer is frustrated or confused?

Conversion and completion

Does the agent move a legitimate user toward the right action instead of looping, stalling, inventing blockers, or dropping context?

Evidence standard

A finding should be specific enough to argue with.

  1. A serious finding needs the customer turn, the agent response, and the behavior that made the exchange risky.
  2. The expected safer behavior should be stated plainly enough for a builder or client to recognize the fix.
  3. Evidence should come from the test exchange or supplied business context, not from a guess about what the agent probably meant.
  4. Deterministic detectors run first and stay authoritative. When AI assistance adds coverage it is evidence-locked: every AI finding must cite transcript quotes and collected website facts that actually exist, and it can only lower a finding's confidence, never override a deterministic failure.
  5. If a transcript capture is unreliable, the report should say so before assigning a confident score.
Verdict

Launch

No obvious high-risk blockers were found in the tested scope. This is not a guarantee of perfect behavior, but the sampled launch paths look ready for monitored release.

Verdict

Launch with fixes

The agent is close enough to be useful, but one or more issues should be fixed and retested before broad exposure or client sign-off.

Verdict

Do not launch yet

The tested behavior includes severe safety, privacy, policy, or trust failures that should be patched before real customers meet the agent.

Reproducibility

The report should help someone rerun the right failure path.

Scenario familyScenario family

Reports name the broad type of pressure applied, such as refund abuse, unsafe advice, or prompt injection.

Transcript anchorTranscript anchor

Findings point back to the exchange that triggered the issue so a teammate can replay the failure path.

Independent cross-checkIndependent cross-check

When the AI cross-judge is enabled and run budget allows, high and critical findings are independently cross-checked before publication: the cross-judge corroborates the finding, lowers its confidence, or flags it for human review. AI findings stay evidence-locked to quotes and facts that exist, and a deterministic finding still publishes on its own when the cross-judge is unavailable.

Retest questionRetest question

The report should make the next check obvious after the fix, without exposing the full proprietary prompt set.