AI Safety

Getting ready for the grown-ups to arrive

October 31, 2024 — January 19, 2025

adversarial

catastrophe

economics

faster pussycat

innovation

language

machine learning

mind

neural nets

NLP

security

tail risk

technology

Suspiciously similar content

Forked from superintelligence because the risk mitigation strategies are a field in themselves. Or rather, several distinct fields, which I need to map out in this notebook.

1 X-risk

X-risk is a term used in, e.g. the rationalist community to discuss risks of a possible AI explosion.

FWIW: I personally think that (various kinds of) AI catastrophic risk are plausible and serious enough to worry about, even if they are not the most likely option, because even if they are moderately unlikely they are very high impact. In decision theory terms, the expected value of the risk is high. If the possibility is that everyone dies, then we should be worried about it, even if it is only a 1% chance.

This kind of thing is empirically difficult for people to reason about..

2 Background

3 X-risk risk

There are people who think that focusing on X-risk is itself a risky distraction from more pressing problems, especially accelerationists.

e.g. what if we do not solve the climate crisis because we put effort into the AI risks instead? Or so much effort that it slowed down the AI that could have saved us? Or so much effort that we got distracted from other more pressing risks?

Example: Superintelligence: The Idea That Eats Smart People ¹ See also the currently-viral school of X-risk-risk-critique that classifies as a tribal marker of TESCREALism

AFAICT, the distinctions here are mostly sociological? The activities to manage X-risk are not necessarily in conflict with other activities to manage other risks. Moreover, getting the human species ready to deal with catastrophes in general seems like a feasible intermediate goal. TBC.

3.1 Most-important century model

Holden Karnofsky, The “most important century” blog post series
Robert Wiblin’s analysis: This could be the most important century

4 Intuitionistic AI safety

Reasoning about harms of AI is weird.

In some ways, arguing that chatbots should not be able to simulate hateful speech is like arguing we shouldn’t simulate car crashes. In my line of work, simulating things is precisely how we learn to prevent them. Generally, if something is terrible, it is very important to understand it in order to avoid it. It seems to me that understanding how hateful content arises can be achieved through simulation, just as car crashes can be understood through simulation. I would like to avoid both hate and car crashes.

OTOH if a car crash simulator ends up in the hands of terrorists there is a limit to how much damage it can do, because simulated car crashes are not real. By contrast, there does not seem to be a distinction between simulated speech and real speech, just so long as someone hears it.

Kareem Carr more rigorously describes what he thinks people imagine the machines should do. He does, IMO, articulate crisply what is going on when we talk vaguely about biased models in public discourse.

He calls the recommendation a solution. I resist that because I think the problem is ill-defined. Equity is contextual, and in this ill-defined domain where the purpose of the models is not clear (placating pundits? fuelling Twitter discourse? making people feel good? Making some people feel good? generating advertising copy? Fuelling intellectual debate?), we cannot be definitive.

5 Theoretical tools

5.1 Courses

5.2 SLT

Singular learning theory has been pitched to me as a tool with applications to AI safety.

6 In Australia

See AI Safety in Australia.

7 Governance issues

See AI in governance

Figure 4: Looks like AI Safety is going fine.

8 Incoming

AI Safety Asia (AISA)
Writing Doom – Award-Winning Short Film on Superintelligence (2024) (video)
AiSafety.com’s landscape map: https://aisafety.world/
Deep atheism and AI risk - Joe Carlsmith
Wong and Bartlett (2022)

we hypothesize that once a planetary civilization transitions into a state that can be described as one virtually connected global city, it will face an ‘asymptotic burnout’, an ultimate crisis where the singularity-interval time scale becomes smaller than the.env time scale of innovation. If a civilization develops the capability to understand its own trajectory, it will have a window of time to affect a fundamental change to prioritize long-term homeostasis and well-being over unyielding growth—a consciously induced trajectory change or ‘homeostatic awakening’. We propose a new resolution to the Fermi paradox: civilizations either collapse from burnout or redirect themselves to prioritising homeostasis, a state where cosmic expansion is no longer a goal, making them difficult to detect remotely.
Dario Amodei — Machines of Loving Grace
MATS Winter 2023-24 Retrospective
Ten Hard Problems in and around AI

We finally published our big 90-page intro to AI. Its likely effects, from ten perspectives, ten camps. The whole gamut: ML, scientific applications, social applications, access, safety and alignment, economics, AI ethics, governance, and classical philosophy of life.
Machine Dreams - Dreaming Machines - Joscha Bach
2022 Expert Survey on Progress in AI at the AI Impacts Wiki
The follow-on 2024 Survey of 2,778 AI authors: six parts in pictures
Douglas Hofstadter changes his mind on Deep Learning & AI risk
François Chollet, The implausibility of intelligence explosion
Stuart Russell on Making Artificial Intelligence Compatible with Humans, an interview on various themes in his book (Russell 2019)
Attempted Gears Analysis of AGI Intervention Discussion With Eliezer
Kevin Scott argues for trying to find a unifying notion of what knowledge work is to unify what humans and machines can do (Scott 2022).
Planned Obsolescence
Call for the Special Track on AI Alignment - AAAI
OpenAI’s CBRN tests seem unclear
Frontier AI systems have surpassed the self-replicating red line — EA Forum
Dan Hendrycks
Center for AI Safety (CAIS)
Deadly by Default - by Duncan Sabien - Homo Sabiens

9 References

Bengio. 2024. “International Scientific Report on the Safety of Advanced AI - Interim Report.”

Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies.

Ecoffet, and Lehman. 2021. “Reinforcement Learning Under Moral Uncertainty.” In Proceedings of the 38th International Conference on Machine Learning.

Grace, Stewart, Sandkühler, et al. 2024. “Thousands of AI Authors on the Future of AI.”

Greenblatt, Shlegeris, Sachan, et al. 2024. “AI Control: Improving Safety Despite Intentional Subversion.”

Hendrycks, Mazeika, and Woodside. 2023. “An Overview of Catastrophic AI Risks.”

Manheim, and Garrabrant. 2019. “Categorizing Variants of Goodhart’s Law.”

Nathan, and Hyams. 2021. “Global Policymakers and Catastrophic Risk.” Policy Sciences.

Ngo, Chan, and Mindermann. 2024. “The Alignment Problem from a Deep Learning Perspective.”

Omohundro. 2008. “The Basic AI Drives.” In Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference.

Russell. 2019. Human Compatible: Artificial Intelligence and the Problem of Control.

Sastry, Heim, Belfield, et al. n.d. “Computing Power and the Governance of Artificial Intelligence.”

Scott. 2022. “I Do Not Think It Means What You Think It Means: Artificial Intelligence, Cognitive Work & Scale.” American Academy of Arts & Sciences.

Taylor, Yudkowsky, LaVictoire, et al. 2020. “Alignment for Advanced Machine Learning Systems.” In Ethics of Artificial Intelligence.

Weidinger, Uesato, Rauh, et al. 2022. “Taxonomy of Risks Posed by Language Models.” In 2022 ACM Conference on Fairness, Accountability, and Transparency.

Wong, and Bartlett. 2022. “Asymptotic Burnout and Homeostatic Awakening: A Possible Solution to the Fermi Paradox?” Journal of The Royal Society Interface.

Zhuang, and Hadfield-Menell. 2021. “Consequences of Misaligned AI.”

Footnotes

I thought that effective altruism meta criticism was the idea that ate smart people.↩︎