[Submitted on 17 Mar 2023]
Abstract: The rapid progress of Large Language Models (LLMs) has made them capable of
performing astonishingly well on various tasks including document completion
and question answering. The unregulated use of these models, however, can
potentially lead to malicious consequences such as plagiarism, generating fake
news, spamming, etc. Therefore, reliable detection of AI-generated text can be
critical to ensure the responsible use of LLMs. Recent works attempt to tackle
this problem either using certain model signatures present in the generated
text outputs or by applying watermarking techniques that imprint specific
patterns onto them. In this paper, both empirically and theoretically, we show
that these detectors are not reliable in practical scenarios. Empirically, we
show that paraphrasing attacks, where a light paraphraser is applied on top of
the generative text model, can break a whole range of detectors, including the
ones using the watermarking schemes as well as neural network-based detectors
and zero-shot classifiers. We then provide a theoretical impossibility result
indicating that for a sufficiently good language model, even the best-possible
detector can only perform marginally better than a random classifier. Finally,
we show that even LLMs protected by watermarking schemes can be vulnerable
against spoofing attacks where adversarial humans can infer hidden watermarking
signatures and add them to their generated text to be detected as text
generated by the LLMs, potentially causing reputational damages to their
developers. We believe these results can open an honest conversation in the
community regarding the ethical and reliable use of AI-generated text.
Submission history
From: Aounon Kumar [view email]
[v1]
Fri, 17 Mar 2023 17:53:19 UTC (926 KB)
Read More
Raleigh Menjivar
