Linguistic Obfuscation Attacks and Large Language Model Uncertainty

Sebastian Steindl, Ulrich Schäfer, Bernd Ludwig, Patrick Levi


Abstract
Large Language Models (LLMs) have taken the research field of Natural Language Processing by storm. Researchers are not only investigating their capabilities and possible applications, but also their weaknesses and how they may be exploited.This has resulted in various attacks and “jailbreaking” approaches that have gained large interest within the community.The vulnerability of LLMs to certain types of input may pose major risks regarding the real-world usage of LLMs in productive operations.We therefore investigate the relationship between a LLM’s uncertainty and its vulnerability to jailbreaking attacks.To this end, we focus on a probabilistic point of view of uncertainty and employ a state-of-the art open-source LLM.We investigate an attack that is based on linguistic obfuscation.Our results indicate that the model is subject to a higher level of uncertainty when confronted with manipulated prompts that aim to evade security mechanisms.This study lays the foundation for future research into the link between model uncertainty and its vulnerability to jailbreaks.
Anthology ID:
2024.uncertainlp-1.4
Volume:
Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)
Month:
March
Year:
2024
Address:
St Julians, Malta
Editors:
Raúl Vázquez, Hande Celikkanat, Dennis Ulmer, Jörg Tiedemann, Swabha Swayamdipta, Wilker Aziz, Barbara Plank, Joris Baan, Marie-Catherine de Marneffe
Venues:
UncertaiNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35–40
Language:
URL:
https://aclanthology.org/2024.uncertainlp-1.4
DOI:
Bibkey:
Cite (ACL):
Sebastian Steindl, Ulrich Schäfer, Bernd Ludwig, and Patrick Levi. 2024. Linguistic Obfuscation Attacks and Large Language Model Uncertainty. In Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pages 35–40, St Julians, Malta. Association for Computational Linguistics.
Cite (Informal):
Linguistic Obfuscation Attacks and Large Language Model Uncertainty (Steindl et al., UncertaiNLP-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.uncertainlp-1.4.pdf