Silicon Sonnets

Weak-to-Strong Jailbreaking on Large Language Models

February 01, 2024
Silicon Sonnets
Weak-to-Strong Jailbreaking on Large Language Models
Show Notes

This study reveals a new method for jailbreaking aligned large language models (LLMs) using smaller models to guide the attack, posing a significant safety issue. A defense strategy is proposed, but creating advanced defenses remains challenging.