AI Alignment Fast-Track Course
Scattered notes from the floor
January 10, 2025 — January 13, 2025
adversarial
economics
faster pussycat
innovation
language
machine learning
mind
neural nets
NLP
security
tail risk
technology
Suspiciously similar content
Notes on AI Alignment Fast-Track - Losing control to AI
1 Session 1
- What is AI alignment? – BlueDot Impact
- More Is Different for AI
- Paul Christiano, What failure looks like 👈 my favourite. Cannot believe I had not read this.
- AI Could Defeat All Of Us Combined
- Why AI alignment could be hard with modern deep learning
Terminology I should have already known but did not: Convergent Instrumental Goals.
- Self-Preservation
- Goal Preservation
- Resource Acquisition
- Self-Improvement
Ajeya Cotra’s intuitive taxonomy of different failure modes
- Saints
- Sycophants
- Schemers.
2 Session 2
RLHF and Constitutionanl AI
Problems with Reinforcement Learning from Human Feedback (RLHF) for AI safety – BlueDot Impact
RLAIF vs. RLHF: the technology behind Anthropic’s Claude (Constitutional AI Explained) - YouTube
The True Story of How GPT-2 Became Maximally Lewd - YouTube
- OpenAI blog post: https://openai.com/research/fine-tuni…
- OpenAI paper behind the blog post: https://arxiv.org/pdf/1909.08593.pdf
- RLHF explainer on Hugging Face: https://huggingface.co/blog/rlhf
- RLHF explainer on aisafety.info https://aisafety.info/?state=88FN_904…
3 Session 3
- Can we scale human feedback for complex AI tasks? An intro to scalable oversight. – BlueDot Impact
- Robert Miles on Using Dangerous AI, But Safely?
- [2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models
- [1810.08575] Supervising strong learners by amplifying weak experts
- Factored Cognition | Ought
- [2402.06782] Debating with More Persuasive LLMs Leads to More Truthful Answers
- Adversarial Machine Learning explained! | With examples. - YouTube
- AI Control: Improving Safety Despite Intentional Subversion — AI Alignment Forum
- [2312.06942] AI Control: Improving Safety Despite Intentional Subversion
- [2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
4 Misc things learned
5 References
Bai, Kadavath, Kundu, et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.”
Bao, and Ullah. 2009. “Expectation of Quadratic Forms in Normal and Nonnormal Variables with Econometric Applications.” 200907. Working Papers.
Barez, Fu, Prabhu, et al. 2025. “Open Problems in Machine Unlearning for AI Safety.”
Burns, Izmailov, Kirchner, et al. 2023. “Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”
Christiano, Shlegeris, and Amodei. 2018. “Supervising Strong Learners by Amplifying Weak Experts.”
Cloud, Goldman-Wetzler, Wybitul, et al. 2024. “Gradient Routing: Masking Gradients to Localize Computation in Neural Networks.”
Everitt, Carey, Langlois, et al. 2021. “Agent Incentives: A Causal Perspective.” In Proceedings of the AAAI Conference on Artificial Intelligence.
Everitt, Kumar, Krakovna, et al. 2019. “Modeling AGI Safety Frameworks with Causal Influence Diagrams.”
Greenblatt, Shlegeris, Sachan, et al. 2024. “AI Control: Improving Safety Despite Intentional Subversion.”
Hammond, Fox, Everitt, et al. 2023. “Reasoning about Causality in Games.” Artificial Intelligence.
Hubinger, Jermyn, Treutlein, et al. 2023. “Conditioning Predictive Models: Risks and Strategies.”
Irving, Christiano, and Amodei. 2018. “AI Safety via Debate.”
Khan, Hughes, Valentine, et al. 2024. “Debating with More Persuasive LLMs Leads to More Truthful Answers.”
Leech, Garfinkel, Yagudin, et al. 2024. “Ten Hard Problems in Artificial Intelligence We Must Get Right.”
Richens, and Everitt. 2024. “Robust Agents Learn Causal World Models.”
Wang, Variengien, Conmy, et al. 2022. “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.”
Ward, MacDermott, Belardinelli, et al. 2024. “The Reasons That Agents Act: Intention and Instrumental Goals.”
Zou, Wang, Kolter, et al. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.”