Marginalium

A note in the margins

May 21, 2023

Marginalium: May 20, 2023

Marginalium: May 22, 2023

Marginalium

My commentary on something from elsewhere on the web.

On the alien characteristics of LLMs: the Waluigi effect.

Short version:

After you train an LLM to satisfy a desirable property P, then it’s easier to elicit the chatbot into satisfying the exact opposite of property P

Why?

When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode.

filed under:

absit-omnia animal-sentience betterment digital-architecture on-ethics on-thinking-and-reasoning somatic-architecture

Dark | Analects | About | Top