Each new release of large language models (LLMs) often comes with claims of both improved performance and enhanced safety. However, there is a lack of standardized safety assessments and a gap in studying these metrics over time. This working paper aims to address this gap by analyzing performance on various standardized safety benchmarks across various LLMs released in the last three years to gauge if they are becoming safer. Under this method of evaluation, newer models are overall scoring higher on these benchmarks; however, these improvements are not dramatic, and when the newer models do fail, these failures are far more consequential as more current models are more capable of causing harm. Going forward, these safety benchmarks should consider this added dimension of quantifying how harmful LLM failures can be. It is recommended to devise a system in which the vulnerabilities of LLMs can be studied, shared and addressed, but the specifics on how to exploit them are guarded by bad actors. Finally, since improvements in safety do not seem to be naturally keeping pace with improvements in overall artificial intelligence, more external pressure is required to ensure we sufficiently guard against the release of dangerous models.
