Verification and detection of generative AI

Watermarks, cryptographic verification and other certificates of authenticity for our computation

October 29, 2024 — November 1, 2024

adversarial
approximation
Bayes
generative
likelihood free
Monte Carlo
neural nets
optimization
probabilistic algorithms
probability
security
statistics
unsupervised
Figure 1

Various interesting challenges in this domain.

How do we verify the authenticity of content? Can we tell if a piece of text was generated by an AI, and if so, which one? Can we do this adversarially? (Doubtful) Can we do this cooperatively, and have AI models sign their outputs with cryptographic signatures as proof that a particular model generated them, as a kind of quality assurance? (Less doubtful but still not certain)

Enter the world of watermarking and cryptographic signatures for AI outputs and AI models.

Consider scenarios like academic integrity, misinformation campaigns, or intellectual property rights. If someone uses an AI model to generate a paper and passes it off as original work, that’s a headache for educators. On a larger scale, various actors could flood social media with AI-generated propaganda. How much can we mitigate these problems by verifying the origin of content?

Overviews in (Cui et al. 2024; Zhu et al. 2024).

Not covered: Data-blind methods such as homomorphic learning, federated learning…

1 Ownership of models

Keyword: Proof-of-learning, …

(Garg et al. 2023; Goldwasser et al. 2022; Jia et al. 2021)

TBD

2 Proof of training

E.g. Abbaszadeh et al. (2024):

A zero-knowledge proof of training (zkPoT) enables a party to prove that they have correctly trained a committed model based on a committed dataset without revealing any additional information about the model or the dataset. An ideal zkPoT should offer provable security and privacy guarantees, succinct proof size and verifier runtime, and practical prover efficiency. In this work, we present , a zkPoT targeted for deep neural networks (DNNs) that achieves all these goals at once. Our construction enables a prover to iteratively train their model via (mini-batch) gradient descent, where the number of iterations need not be fixed in advance; at the end of each iteration, the prover generates a commitment to the trained model parameters attached with a succinct zkPoT, attesting to the correctness of the executed iterations. The proof size and verifier time are independent of the number of iterations.

3 Watermarking outputs

A.k.a. fingerprinting. I’m sceptical that this is of any practical use, but it is a good theoretical starting point.

Watermarking is about embedding a hidden signal within data that can later be used to verify its source. The trick is to do this without altering the human-perceptible content. For images, this might involve tweaking pixel values in a way that’s imperceptible to the human eye but detectable through analysis.

3.1 Text

For text, it’s trickier (Huang et al. 2024; Li et al. 2024).

One approach is to modify the probabilities in the language model’s output to favour certain token patterns. Suppose we’re using a language model that predicts the next word based on some probability distribution. By slightly biasing this distribution, we can make the model more likely to choose words that fit a particular statistical pattern.

For example, we could define a hash function \(h\) that maps the context and potential next tokens to a numerical space. We then adjust the probabilities so that tokens with hashes satisfying a certain condition (like being within a specific range) are more likely to be selected.

4 Cryptographic Signatures on outputs

TBD

5 Adversarial signing

Of course, any watermarking scheme must consider the possibility of adversaries trying to remove or alter the watermark. This leads us into game theory and adversarial models. We need to design watermarking methods that are robust against attempts to detect and strip them.

One way is to make the watermark indistinguishable from the natural output of the model. If the statistical patterns introduced are subtle and aligned with the model’s inherent probabilities, it becomes exceedingly difficult for an adversary to pinpoint and remove the watermark without degrading the text quality.

Suppose our language model defines a probability distribution over sequences \(\mathbf {w} = (w_1, w_2, \dots, w_n)\). Our goal is to define a modified distribution \(P'(\mathbf {w})\) such that:

  1. The Kullback-Leibler divergence \(D_{\text {KL}}(P' \| P)\) is minimised to maintain text quality.
  2. There exists a detection function \(D(\mathbf {w})\) that can, with high probability, verify whether \(\mathbf {w}\) was generated from \(P'\).

This sets up an optimisation problem where we balance the fidelity of the text with the detectability of the watermark, and it sounds like a classic adversarial learning problem.

6 References

Abbaszadeh, Pappas, Katz, et al. 2024. Zero-Knowledge Proofs of Training for Deep Neural Networks.”
Cui, Wang, Fu, et al. 2024. Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems.”
Garg, Goel, Jha, et al. 2023. Experimenting with Zero-Knowledge Proofs of Training.” In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. CCS ’23.
Goldwasser, Kim, Vaikuntanathan, et al. 2022. Planting Undetectable Backdoors in Machine Learning Models.”
Huang, Zhu, Zhu, et al. 2024. Towards Optimal Statistical Watermarking.”
Jia, Yaghini, Choquette-Choo, et al. 2021. Proof-of-Learning: Definitions and Practice.” In 2021 IEEE Symposium on Security and Privacy (SP).
Li, Ruan, Wang, et al. 2024. A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules.”
Ngo, Chan, and Mindermann. 2024. The Alignment Problem from a Deep Learning Perspective.”
Zhu, Mu, Jiao, et al. 2024. Generative AI Security: Challenges and Countermeasures.”