AI Alignment Fast-Track Course

Scattered notes from the floor

January 10, 2025 — January 13, 2025

adversarial
economics
faster pussycat
innovation
language
machine learning
mind
neural nets
NLP
security
tail risk
technology

Notes on AI Alignment Fast-Track - Losing control to AI

1 Session 1

Terminology I should have already known but did not: Convergent Instrumental Goals.

  • Self-Preservation
  • Goal Preservation
  • Resource Acquisition
  • Self-Improvement

Ajeya Cotra’s intuitive taxonomy of different failure modes

  • Saints
  • Sycophants
  • Schemers.

2 Session 2

RLHF and Constitutionanl AI

3 Session 3

4 Misc things learned

5 References

Bai, Kadavath, Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.”
Bao, and Ullah. 2009. Expectation of Quadratic Forms in Normal and Nonnormal Variables with Econometric Applications.” 200907. Working Papers.
Barez, Fu, Prabhu, et al. 2025. Open Problems in Machine Unlearning for AI Safety.”
Burns, Izmailov, Kirchner, et al. 2023. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”
Christiano, Shlegeris, and Amodei. 2018. Supervising Strong Learners by Amplifying Weak Experts.”
Cloud, Goldman-Wetzler, Wybitul, et al. 2024. Gradient Routing: Masking Gradients to Localize Computation in Neural Networks.”
Everitt, Carey, Langlois, et al. 2021. Agent Incentives: A Causal Perspective.” In Proceedings of the AAAI Conference on Artificial Intelligence.
Everitt, Kumar, Krakovna, et al. 2019. Modeling AGI Safety Frameworks with Causal Influence Diagrams.”
Greenblatt, Shlegeris, Sachan, et al. 2024. AI Control: Improving Safety Despite Intentional Subversion.”
Hammond, Fox, Everitt, et al. 2023. Reasoning about Causality in Games.” Artificial Intelligence.
Hubinger, Jermyn, Treutlein, et al. 2023. Conditioning Predictive Models: Risks and Strategies.”
Irving, Christiano, and Amodei. 2018. AI Safety via Debate.”
Khan, Hughes, Valentine, et al. 2024. Debating with More Persuasive LLMs Leads to More Truthful Answers.”
Leech, Garfinkel, Yagudin, et al. 2024. Ten Hard Problems in Artificial Intelligence We Must Get Right.”
Richens, and Everitt. 2024. Robust Agents Learn Causal World Models.”
Wang, Variengien, Conmy, et al. 2022. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.”
Ward, MacDermott, Belardinelli, et al. 2024. The Reasons That Agents Act: Intention and Instrumental Goals.”
Zou, Wang, Kolter, et al. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models.”