A Warning Label on the Use of AI Safety Evaluations

Digital Policy Hub Working Paper

Ashley Ferreira

December 12, 2025

Emerging research demonstrates that existing artificial intelligence (AI) pre-deployment safety evaluations frequently underestimate models’ potential for causing harm. There are critical limitations to current AI safety evaluations: these limitations include the instability of safety measurements as applied to benign perturbations, the persistent ability of AI models to break past the safety guardrails being evaluated, deception and evaluation awareness on the part of models, lack of clear protocols for the application of evaluation results to real-world risk as well as lack of action on existing evidence. Due to the inherent unreliability of many of these assessment tools, they should be used cautiously by policy makers and should not serve as a primary risk management strategy for AI governance frameworks. Effective AI governance should prioritize continuous monitoring and rapid response mechanisms, while recognizing the limitations of pre-deployment safety evaluations.

About the Author

Ashley Ferreira

Ashley is a former Digital Policy Hub undergraduate fellow and a student in the physics and astronomy program at the University of Waterloo.

A Warning Label on the Use of AI Safety Evaluations

Digital Policy Hub Working Paper

About the Author

Recommended

The Prosocial AI Index: What Governments Need to Know Before Deploying AI

China’s Influence on Standards Development Organizations in the Digital Age

Can Deliberate Policies Protect Us from Algorithmic Agency Asymmetry?

Beyond Harmony, Beyond Rhetoric: Politicizing Ubuntu for AI Governance

Party Drugs for Trade Wars (unintended consequences with Chad P. Bown and Soumaya Keynes)

Canada’s AI Strategy Is Missing a Foreign Policy

Canada Cannot Compete on AI Regulation, but It Can Coordinate It

Sovereignty by Download

How Nigeria’s Fake News Industry Has Evolved to Embrace AI

AI and Refugee Protection in the African Union: Smart Borders or Fortress Africa?

The Abundance Paradox: Artificial General Intelligence and the End of Scarcity

AI at War: What the Ongoing Conflicts Reveal About Power, Technology and Ethics