AI Alignment Fast-Track Course

Scattered notes from the floor

January 10, 2025 — January 13, 2025

adversarial
economics
faster pussycat
innovation
language
machine learning
mind
neural nets
NLP
security
tail risk
technology

Notes on AI Alignment Fast-Track - Losing control to AI

1 Session 1

Terminology I should have already known but did not: Convergent Instrumental Goals.

  • Self-Preservation
  • Goal Preservation
  • Resource Acquisition
  • Self-Improvement

Ajeya Cotra’s intuitive taxonomy of different failure modes

  • Saints
  • Sycophants
  • Schemers.

2 Session 2

RLHF and Constitutionanl AI

3 Session 3

4 Misc things learned