Large Language Models and the Disappearing Private Sphere
Large language models such as ChatGPT and Llama have exploded in popularity but do they respect our privacy?
Read our full reportExecutive Summary
Large language models have rather suddenly become a major source of interest for teachers, writers, business leaders, general internet users, artificial intelligence researchers and policy makers. It remains to be seen whether these tools will revolutionize industries or gradually reveal themselves to be no more interesting than tools like spell-checkers. In the meantime, they have stirred up a hornets nest of questions about privacy rights, copyright and research methodology.
Our report includes an accessible introduction to the technology and the regulatory context in which it sits. We then report on a survey we conducted with university research ethics experts across Canada about how research ethics review boards are currently handling AI research and research involving data scraped from the web. We draw out recommendations for how current practices might need to change.
We then report on privacy leakage from large language models. Several research reports have demonstrated that private information from the training data can leak into the text large language models output to users. There is also reason to worry that private information that users are divulging while interacting with chatbots could be leaked. While the developers of these services have patched their code with guardrails to prevent such leaks, little is known about their efficacy. We review these reported leaks and the patches applied to them here, and share the results of supplementary experiments we ran to investigate whether the patches work consistently across models, attack types and parameter settings.Finally we analyze gaps in policy concerning large language models (and artificial intelligence research more broadly). We draw out a series of recommendations for the Tri-Council, for institutional research ethics boards, for artificial intelligence developers, and for policy makers, with the aim of protecting the privacy of Canadians who use large language models, and whose data was used to train them.