AI Alignment Fast-Track Course
Scattered notes from the floor
January 10, 2025 — January 13, 2025
adversarial
economics
faster pussycat
innovation
language
machine learning
mind
neural nets
NLP
security
tail risk
technology
Notes on AI Alignment Fast-Track - Losing control to AI
1 Session 1
- What is AI alignment? – BlueDot Impact
- More Is Different for AI
- Paul Christiano, What failure looks like 👈 my favourite. Cannot believe I had not read this.
- AI Could Defeat All Of Us Combined
- Why AI alignment could be hard with modern deep learning
Terminology I should have already known but did not: Convergent Instrumental Goals.
- Self-Preservation
- Goal Preservation
- Resource Acquisition
- Self-Improvement
Ajeya Cotra’s intuitive taxonomy of different failure modes
- Saints
- Sycophants
- Schemers.
2 Session 2
RLHF and Constitutionanl AI
Problems with Reinforcement Learning from Human Feedback (RLHF) for AI safety – BlueDot Impact
ML-14874_image001.jpg (JPEG Image, 4250 × 1888 pixels) — Scaled (33%)
RLAIF vs. RLHF: the technology behind Anthropic’s Claude (Constitutional AI Explained) - YouTube
The True Story of How GPT-2 Became Maximally Lewd - YouTube
- OpenAI blog post: https://openai.com/research/fine-tuni…
- OpenAI paper behind the blog post: https://arxiv.org/pdf/1909.08593.pdf
- RLHF explainer on Hugging Face: https://huggingface.co/blog/rlhf
- RLHF explainer on aisafety.info https://aisafety.info/?state=88FN_904…
3 Session 3
- Can we scale human feedback for complex AI tasks? An intro to scalable oversight. – BlueDot Impact
- Robert Miles on Using Dangerous AI, But Safely?
- [2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models
- [1810.08575] Supervising strong learners by amplifying weak experts
- Factored Cognition | Ought
- [2402.06782] Debating with More Persuasive LLMs Leads to More Truthful Answers
- Adversarial Machine Learning explained! | With examples. - YouTube
- AI Control: Improving Safety Despite Intentional Subversion — AI Alignment Forum
- [2312.06942] AI Control: Improving Safety Despite Intentional Subversion
- [2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision