Silicon Sonnets

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

October 09, 2023
Silicon Sonnets
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Show Notes

The paper discusses the safety risks associated with fine-tuning large language models. It reveals that safety alignment can be compromised with a few adversarially designed training examples. The paper suggests further research for reinforcing safety protocols.