Silicon Sonnets

Universal Jailbreak Backdoors from Poisoned Human Feedback

February 12, 2024
Silicon Sonnets
Universal Jailbreak Backdoors from Poisoned Human Feedback
Show Notes

This paper discusses the embedding of universal jailbreak backdoors into LLMs through poisoned human feedback, enabling harmful responses with a trigger word.