Verification and detection of generative AI
Watermarks, cryptographic verification of the products of AI
October 29, 2024 — November 25, 2024
Various interesting challenges in this domain.
How do we verify the authenticity of content? Can we tell if a piece of text was generated by an AI, and if so, which one? Can we do this adversarially? Sounds like an asymmetric arms race. Can we do this cooperatively, and have AI models sign their outputs with cryptographic signatures as proof that a particular model generated them, as a kind of quality assurance? Sounds like a classic standardisation problem.
Answering these questions defines the field of watermarking and cryptographic signatures for AI outputs and AI models. Useful in scenarios like academic integrity, misinformation campaigns, or intellectual property rights. If someone uses an AI model to generate a paper and passes it off as original work, that’s a headache for educators. On a larger scale, various actors could flood social media with AI-generated propaganda. How much can we mitigate these problems by verifying the origin of content?
Overviews in (Cui et al. 2024; Zhu et al. 2024).
Not covered: Data-blind methods such as homomorphic learning, federated learning…
1 Watermarking outputs
A.k.a. fingerprinting. I’m sceptical that this is of any practical use, but it is a good theoretical starting point.
Watermarking is about embedding a hidden signal within data that can later be used to verify its source. The trick is to do this without altering the human-perceptible content. For images, this might involve tweaking pixel values in a way that’s imperceptible to the human eye but detectable through analysis.
1.1 Text
For text, it’s trickier (Huang et al. 2024; Li et al. 2024).
One approach is to modify the probabilities in the language model’s output to favour certain token patterns. Suppose we’re using a language model that predicts the next word based on some probability distribution. By slightly biasing this distribution, we can make the model more likely to choose words that fit a particular statistical pattern.
For example, we could define a hash function \(h\) that maps the context and potential next tokens to a numerical space. We then adjust the probabilities so that tokens with hashes satisfying a certain condition (like being within a specific range) are more likely to be selected.
2 Adversarial watermarking
A watermarking scheme must consider the possibility of adversaries trying to remove or alter the watermark. In practice, we need to design watermarking methods that are robust against attempts to detect and strip them. This leads us into adversarial models, i.e. game theory.
One way is to make the watermark indistinguishable from the natural output of the model. If the statistical patterns introduced are subtle and aligned with the model’s inherent probabilities, it becomes exceedingly difficult for an adversary to pinpoint and remove the watermark without degrading the text quality.
Suppose our language model defines a probability distribution over sequences \(\mathbf {w} = (w_1, w_2, \dots, w_n)\). Our goal is to define a modified distribution \(P'(\mathbf {w})\) such that:
- The Kullback-Leibler divergence \(D_{\text {KL}}(P' \| P)\) is minimised to maintain text quality.
- There exists a detection function \(D(\mathbf {w})\) that can, with high probability, verify whether \(\mathbf {w}\) was generated from \(P'\).
This sets up an optimisation problem where we balance the fidelity of the text with the detectability of the watermark, and it sounds like a classic adversarial learning problem.
Where was I going with this?