Aligning AI systems

Practical approaches to domesticating wild models. RLHF, Constitutional AI, etc

January 19, 2025 — February 10, 2025

adversarial

classification

communicating

feature construction

game theory

high d

language

machine learning

metrics

mind

NLP

Suspiciously similar content

Placeholder.

Notes on how to implement alignment in AI systems. This is necessarily a fuzzy concept, because Alignment is fuzzy and AI is fuzzy. We need to make peace with the frustrations of this fuzziness and move on.

1 Fine tuning to do nice stuff

Think RLHF, Constitutional AI etc. I’m not greatly persuaded that these are the right way to go, but they are interesting.

2 Classifying models as unaligned

I’m familiar only with mechanistic interpretability at the moment; I’m sure there is other stuff.

3 Incoming

4 References

Aguirre, Dempsey, Surden, et al. 2020. “AI Loyalty: A New Paradigm for Aligning Stakeholder Interests.” IEEE Transactions on Technology and Society.

Bai, Kadavath, Kundu, et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.”

Barez, Fu, Prabhu, et al. 2025. “Open Problems in Machine Unlearning for AI Safety.”

Burns, Izmailov, Kirchner, et al. 2023. “Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”

Christiano, Shlegeris, and Amodei. 2018. “Supervising Strong Learners by Amplifying Weak Experts.”

Greenblatt, Denison, Wright, et al. 2024. “Alignment Faking in Large Language Models.”

Greenblatt, Shlegeris, Sachan, et al. 2024. “AI Control: Improving Safety Despite Intentional Subversion.”

Irving, Christiano, and Amodei. 2018. “AI Safety via Debate.”

Khan, Hughes, Valentine, et al. 2024. “Debating with More Persuasive LLMs Leads to More Truthful Answers.”

Ngo, Chan, and Mindermann. 2024. “The Alignment Problem from a Deep Learning Perspective.”

Stray, Vendrov, Nixon, et al. 2021. “What Are You Optimizing for? Aligning Recommender Systems with Human Values.”

Taylor, Yudkowsky, LaVictoire, et al. 2020. “Alignment for Advanced Machine Learning Systems.” In Ethics of Artificial Intelligence.

Zhuang, and Hadfield-Menell. 2021. “Consequences of Misaligned AI.”

Zou, Wang, Kolter, et al. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.”