Team Misalignment Examples

Training large language models on narrow tasks can lead to broad misalignment

Previous works on finetuning safety largely target misuse-related finetuning attacks that make models comply with harmful requests (‘jailbreak finetuning’ 17). We ran head-to-head evaluations between ...

The Conversation

Getting AIs working toward human goals − study shows how to measure misalignment

Ideally, artificial intelligence agents aim to help humans, but what does that mean when humans want conflicting things? My colleagues and I have come up with a way to measure the alignment of the ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Training large language models on narrow tasks can lead to broad misalignment

Getting AIs working toward human goals − study shows how to measure misalignment

Trending now