AI Alignment Fast-Track Course

Scattered notes from the floor

January 10, 2025 — January 13, 2025

adversarial

economics

faster pussycat

innovation

language

machine learning

mind

neural nets

NLP

security

tail risk

technology

Suspiciously similar content

Notes on AI Alignment Fast-Track - Losing control to AI

1 Session 1

What is AI alignment? – BlueDot Impact
More Is Different for AI
Paul Christiano, What failure looks like 👈 my favourite. Cannot believe I had not read this.
AI Could Defeat All Of Us Combined
Why AI alignment could be hard with modern deep learning

Terminology I should have already known but did not: Convergent Instrumental Goals.

Self-Preservation
Goal Preservation
Resource Acquisition
Self-Improvement

Ajeya Cotra’s intuitive taxonomy of different failure modes

Saints
Sycophants
Schemers.

2 Session 2

RLHF and Constitutionanl AI

Problems with Reinforcement Learning from Human Feedback (RLHF) for AI safety – BlueDot Impact
A simple technical explanation of RLH(AI)F | Kairos.fm
[1hr Talk] Intro to Large Language Models - YouTube
RLAIF vs. RLHF: the technology behind Anthropic’s Claude (Constitutional AI Explained) - YouTube
The True Story of How GPT-2 Became Maximally Lewd - YouTube
- OpenAI blog post: https://openai.com/research/fine-tuni…
- OpenAI paper behind the blog post: https://arxiv.org/pdf/1909.08593.pdf
- RLHF explainer on Hugging Face: https://huggingface.co/blog/rlhf
- RLHF explainer on aisafety.info https://aisafety.info/?state=88FN_904…

3 Session 3

4 Misc things learned

AI Safety Asia (AISA)

5 References

Bai, Kadavath, Kundu, et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.”

Bao, and Ullah. 2009. “Expectation of Quadratic Forms in Normal and Nonnormal Variables with Econometric Applications.” 200907. Working Papers.

Barez, Fu, Prabhu, et al. 2025. “Open Problems in Machine Unlearning for AI Safety.”

Burns, Izmailov, Kirchner, et al. 2023. “Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”

Christiano, Shlegeris, and Amodei. 2018. “Supervising Strong Learners by Amplifying Weak Experts.”

Cloud, Goldman-Wetzler, Wybitul, et al. 2024. “Gradient Routing: Masking Gradients to Localize Computation in Neural Networks.”

Everitt, Carey, Langlois, et al. 2021. “Agent Incentives: A Causal Perspective.” In Proceedings of the AAAI Conference on Artificial Intelligence.

Everitt, Kumar, Krakovna, et al. 2019. “Modeling AGI Safety Frameworks with Causal Influence Diagrams.”

Greenblatt, Shlegeris, Sachan, et al. 2024. “AI Control: Improving Safety Despite Intentional Subversion.”

Hammond, Fox, Everitt, et al. 2023. “Reasoning about Causality in Games.” Artificial Intelligence.

Hubinger, Jermyn, Treutlein, et al. 2023. “Conditioning Predictive Models: Risks and Strategies.”

Irving, Christiano, and Amodei. 2018. “AI Safety via Debate.”

Khan, Hughes, Valentine, et al. 2024. “Debating with More Persuasive LLMs Leads to More Truthful Answers.”

Leech, Garfinkel, Yagudin, et al. 2024. “Ten Hard Problems in Artificial Intelligence We Must Get Right.”

Richens, and Everitt. 2024. “Robust Agents Learn Causal World Models.”

Wang, Variengien, Conmy, et al. 2022. “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.”

Ward, MacDermott, Belardinelli, et al. 2024. “The Reasons That Agents Act: Intention and Instrumental Goals.”

Zou, Wang, Kolter, et al. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.”