We Need to Talk about AI Reproducibility

Some AI variants challenge long-standing conceptions of how researchers conduct rigorous science.

March 21, 2024
In 2023, Meta released Llama 2, an open-source model intended to help solve research problems, including through the use of LLMs on mobile devices and AI personalization. (Photo illustration via REUTERS)

There’s a contradiction at the heart of artificial intelligence (AI). On one hand, scientists increasingly rely on variants of the technology for scientific discovery. On the other hand, many variants of AI are black boxes — opaque and unaccountable to their users. One can provide them with input and receive an output, but the user cannot examine the system’s code, data sets or logic. Developers, deployers, policy makers and AI users must address this opacity if we want generative AI models to enable rigorous, reproducible research.

AI is clearly facilitating scientific breakthroughs. Meteorologists have used it to correct for biases in weather patterns. Physicists have used it to match the amount of dark matter to the amount of visible matter in galaxies. Molecular biologists have used generative AI to create new drugs and antibiotics, or to explore complex conditions such as Alzheimer’s disease.

Yet some AI variants, in particular generative AI, challenge long-standing conceptions of how researchers conduct rigorous science. The scientific method is collaborative, based on trial and error. Researchers develop a hypothesis; test their theories; revise their models, if necessary; and then publish their data and findings. These researchers are eager for their peers to reproduce their findings so their work will be deemed credible and reliable. But researchers can’t reproduce the findings about a given AI model (for example, that it can successfully translate a given text in X number of languages) when the model and its underlying data set are a black box.

The companies creating generative AI often tout the outputs of their models but say little about how they built them. A recent Stanford University study of 10 such models found only a handful of developers had demonstrated the limitations of their tools or asked third parties to evaluate their claims about the models’ capabilities. While every developer in the study described the input and output modality of their model, only three disclosed the model’s inputs and only two the model size. Hence, external researchers can test for model capacity, performance or transparency. But without further information about these models and underlying data sets, outside researchers cannot test these models for reliability or trustworthiness — the very metrics that these firms should want outsiders to reproduce.

Moreover, because of that opacity, other researchers cannot easily improve generative AI models. These models can also make mistakes or even fabricate information (i.e., “hallucinations”). Outside researchers cannot help these firms reduce these mistakes and hallucinations without greater information about how these models work and their underlying data.

Recently, some research organizations and private firms have created large language models (LLMs) that are more open, allowing outside researchers to test and reproduce their findings. For example, in 2022, Hugging Face announced the release of BLOOM, the first multilingual LLM available in some 46 natural languages. Any individual or institution that agreed to the terms of the model’s Responsible AI Licence can use and build upon the model on a local machine or on a cloud provider. In 2023, Meta released Llama 2, an open-source model that facilitated collective efforts to mitigate research problems, including the use of LLMs on mobile devices and AI personalization. In February 2024, the Allen Institute for AI released OLMo, an LLM that provides outsiders with greater information about its data, training code, models and evaluation code than many previous models. The institute argued that with such information, researchers will be able to work faster and will no longer need to depend on qualitative assumptions of model performance.

Meanwhile, policy makers are taking some steps to incentivize greater transparency. The EU Artificial Intelligence Act, which was approved by the European Parliament on March 13, includes language to encourage the developers of high-risk AI systems to provide more information on the origins of data used for AI (data provenance). China requires developers of customer-facing AI to take effective measures to improve the training data quality, accuracy and diversity of data underpinning generative AI. The US Congress is considering the AI Foundation Model Transparency Act, which directs the US government to establish rules for how to report training data transparency. 

These developments signal that AI developers and policy makers recognize the importance of greater transparency and reproducibility. The public deserves greater information about the data, models and algorithms that underpin various types of AI. Without such information, AI models will lack scientific rigour and public trust in AI will falter.

The opinions expressed in this article/multimedia are those of the author(s) and do not necessarily reflect the views of CIGI or its Board of Directors.

About the Authors

Susan Ariel Aaronson is a CIGI senior fellow, research professor of international affairs at George Washington University and co-principal investigator with the National Science Foundation/National Institute of Standards and Technology, where she leads research on data and AI governance.

Dhanaraj Thakur is research director of the Center for Democracy and Technology.